GitHub user xwu0226 opened a pull request:

    https://github.com/apache/spark/pull/13120

    [SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI 
when creating external non-hive compatible table

    ## What changes were proposed in this pull request?
    ### Symptom
    ```
    scala> 
spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269")
    Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269
                                                                                
    
    scala> spark.sql("create table spark_15269 using json options(PATH 
'/home/xwu0226/spark-test/data/spark-15269')")
    16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find 
corresponding Hive SerDe for data source provider json. Persisting data source 
relation `spark_15269` into Hive metastore in Spark SQL specific format, which 
is NOT compatible with Hive.
    going through newSparkSQLSpecificMetastoreTable()
    res1: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("drop table spark_15269")
    res2: org.apache.spark.sql.DataFrame = []
    
    scala> spark.sql("create table spark_15269 using json as select 1 as a")
    org.apache.spark.sql.AnalysisException: path 
file:/user/hive/warehouse/spark_15269 already exists.;
      at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
      at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
      at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
      at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    ...
    ```
    The 2nd creation of the table fails complaining about the path exists. 
    
    ### Root cause:
    When the first table is created as external table with the data source 
path, but as json, `createDataSourceTables `considers it as non-hive compatible 
table because `json `is not a Hive SerDe. Then, 
`newSparkSQLSpecificMetastoreTable`is invoked to create the `CatalogTable 
`before asking HiveClient to create the metastore table. In this call, 
`locationURI `is not set. So when we convert `CatalogTable` to HiveTable before 
passing to Hive Metastore, hive table's data location is not set. Then, Hive 
metastore implicitly creates a data location as <hive warehouse>/tableName, 
which is, `file:/user/hive/warehouse/spark_15269` in the above case. 
    
    When dropping the table, hive does not delete this implicitly created path 
because the table is external.
    
    when we create the 2nd table with select and without a path, the table is 
created as managed table, provided a default path in the options as following:
    ```
    val optionsWithPath =
          if (!new CaseInsensitiveMap(options).contains("path")) {
            isExternal = false
            options + ("path" -> 
sessionState.catalog.defaultTablePath(tableIdent))
          } else {
            options
          }
    ```
    This default path happens to be the hive's warehouse directory + the table 
name, which is the same as the one hive metastore implicitly created earlier 
for the 1st table.  So when trying to write the provided data to this data 
source table by `InsertIntoHadoopFsRelation`, which complains about the path 
existence since the SaveMode is SaveMode.ErrorIfExists.
    
    ### Solution:
    When creating an external datasource table that is non-hive compatible, 
make sure we set the provided path to `CatalogTable.storage.locationURI`, so we 
avoid hive metastore from implicitly creating a data location for the table.
    
    ## How was this patch tested?
    Testcase is added. And run regtest. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xwu0226/spark SPARK-15269

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13120
    
----
commit 21d188321284a86176927445fd1703353e0add09
Author: xin Wu <[email protected]>
Date:   2016-05-08T07:06:36Z

    spark-15206 add testcases for distinct aggregate in having clause following 
up PR12974

commit e43d56ab260633d7c2af54a6960cec7eadff34c4
Author: xin Wu <[email protected]>
Date:   2016-05-08T07:09:44Z

    Revert "spark-15206 add testcases for distinct aggregate in having clause 
following up PR12974"
    
    This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe.

commit f9f1f1f36f3759eecfb6070b2372462ee454b700
Author: xin Wu <[email protected]>
Date:   2016-05-13T00:39:45Z

    SPARK-15269: set locationUFI to the non-hive compatible metastore table

commit 58ad82db21f90b571d70371ff25c167ecda17720
Author: xin Wu <[email protected]>
Date:   2016-05-14T20:16:11Z

    SPARK-15269: only for external datasource table

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to