[jira] [Commented] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

Jacek Laskowski (JIRA) Tue, 16 Jan 2018 02:14:37 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326959#comment-16326959
 ]


Jacek Laskowski commented on SPARK-22457:
-----------------------------------------

 That should be fairly easy to fix _iff_ we want to restrict the formats to 
{{FileFormat}} (that the mentioned formats are subtypes of).

Care to submit a pull request with the places where {{path}} is used to limit 
their scope to {{FileFormats}} only? (that would help draw more attention to 
the issue).

> Tables are supposed to be MANAGED only taking into account whether a path is 
> provided
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-22457
>                 URL: https://issues.apache.org/jira/browse/SPARK-22457
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: David Arroyo
>            Priority: Major
>
> As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
> taking into account whether a path is provided:
> {code:java}
> val tableType = if (storage.locationUri.isDefined) {
>       CatalogTableType.EXTERNAL
>     } else {
>       CatalogTableType.MANAGED
>     }
> {code}
> This solution seems to be right for filesystem based data sources. On the 
> other hand, when working with other data sources such as elasticsearch, that 
> solution is leading to a weird behaviour described below: 
> 1) InMemoryCatalog's doCreateTable() adds a locationURI if 
> CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.
> 2) Before loading the data source table FindDataSourceTable's 
> readDataSourceTable() adds a path option if locationURI exists:
> {code:java}
> val pathOption = table.storage.locationUri.map("path" -> 
> CatalogUtils.URIToString(_))
> {code}
> 3) That causes an error when reading from elasticsearch because 'path' is an 
> option already supported by elasticsearch (locationUri is set to 
> file:/home/user/spark-rv/elasticsearch/shop/clients)
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find 
> mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is 
> required before using Spark SQL
> Would be possible only to mark tables as MANAGED for a subset of data sources 
> (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?
> P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
> from my point of view should only be required for filesystem based data 
> sources: 
> {code:java}
>        if (tableMeta.tableType == CatalogTableType.MANAGED)
>        ...
>        // Delete the data/directory of the table
>         val dir = new Path(tableMeta.location)
>         try {
>           val fs = dir.getFileSystem(hadoopConfig)
>           fs.delete(dir, true)
>         } catch {
>           case e: IOException =>
>             throw new SparkException(s"Unable to drop table $table as failed 
> " +
>               s"to delete its directory $dir", e)
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

Reply via email to