David Arroyo Cazorla created SPARK-22457:
--------------------------------------------
Summary: Tables are supposed to be MANAGED only taking into
account whether a path is provided
Key: SPARK-22457
URL: https://issues.apache.org/jira/browse/SPARK-22457
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.0
Reporter: David Arroyo Cazorla
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only
taking into account whether a path is provided:
{code:scala}
val tableType = if (storage.locationUri.isDefined) {
CatalogTableType.EXTERNAL
} else {
CatalogTableType.MANAGED
}
{code}
This solution seems to be right for filesystem based data sources. On the other
hand, when working with other data sources such as elasticsearch, that solution
is leading to a weird behaviour described below.
1) InMemoryCatalog's doCreateTable() adds a locationURI if
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.
2) Before loading the data source table FindDataSourceTable's
readDataSourceTable() adds a path option if locationURI exists:
{code:scala}
val pathOption = table.storage.locationUri.map("path" ->
CatalogUtils.URIToString(_))
{code}
3) That causes an error when reading from elasticsearch because 'path' is an
option already supported by elasticsearch (locationUri is set to
file:/home/user/spark-rv/elasticsearch/shop/clients)
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required
before using Spark SQL
Would be possible only mark tables as MANAGED for a subset of data sources
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?
P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which
from my point of view should only be required for filesystem based data
sources:
{code:scala}
if (tableMeta.tableType == CatalogTableType.MANAGED)
...
// Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
val fs = dir.getFileSystem(hadoopConfig)
fs.delete(dir, true)
} catch {
case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
s"to delete its directory $dir", e)
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]