GitHub user xwu0226 opened a pull request:
https://github.com/apache/spark/pull/13120
[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI
when creating external non-hive compatible table
## What changes were proposed in this pull request?
### Symptom
```
scala>
spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269")
Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269
scala> spark.sql("create table spark_15269 using json options(PATH
'/home/xwu0226/spark-test/data/spark-15269')")
16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find
corresponding Hive SerDe for data source provider json. Persisting data source
relation `spark_15269` into Hive metastore in Spark SQL specific format, which
is NOT compatible with Hive.
going through newSparkSQLSpecificMetastoreTable()
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("drop table spark_15269")
res2: org.apache.spark.sql.DataFrame = []
scala> spark.sql("create table spark_15269 using json as select 1 as a")
org.apache.spark.sql.AnalysisException: path
file:/user/hive/warehouse/spark_15269 already exists.;
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
...
```
The 2nd creation of the table fails complaining about the path exists.
### Root cause:
When the first table is created as external table with the data source
path, but as json, `createDataSourceTables `considers it as non-hive compatible
table because `json `is not a Hive SerDe. Then,
`newSparkSQLSpecificMetastoreTable`is invoked to create the `CatalogTable
`before asking HiveClient to create the metastore table. In this call,
`locationURI `is not set. So when we convert `CatalogTable` to HiveTable before
passing to Hive Metastore, hive table's data location is not set. Then, Hive
metastore implicitly creates a data location as <hive warehouse>/tableName,
which is, `file:/user/hive/warehouse/spark_15269` in the above case.
When dropping the table, hive does not delete this implicitly created path
because the table is external.
when we create the 2nd table with select and without a path, the table is
created as managed table, provided a default path in the options as following:
```
val optionsWithPath =
if (!new CaseInsensitiveMap(options).contains("path")) {
isExternal = false
options + ("path" ->
sessionState.catalog.defaultTablePath(tableIdent))
} else {
options
}
```
This default path happens to be the hive's warehouse directory + the table
name, which is the same as the one hive metastore implicitly created earlier
for the 1st table. So when trying to write the provided data to this data
source table by `InsertIntoHadoopFsRelation`, which complains about the path
existence since the SaveMode is SaveMode.ErrorIfExists.
### Solution:
When creating an external datasource table that is non-hive compatible,
make sure we set the provided path to `CatalogTable.storage.locationURI`, so we
avoid hive metastore from implicitly creating a data location for the table.
## How was this patch tested?
Testcase is added. And run regtest.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xwu0226/spark SPARK-15269
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13120.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13120
----
commit 21d188321284a86176927445fd1703353e0add09
Author: xin Wu <[email protected]>
Date: 2016-05-08T07:06:36Z
spark-15206 add testcases for distinct aggregate in having clause following
up PR12974
commit e43d56ab260633d7c2af54a6960cec7eadff34c4
Author: xin Wu <[email protected]>
Date: 2016-05-08T07:09:44Z
Revert "spark-15206 add testcases for distinct aggregate in having clause
following up PR12974"
This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe.
commit f9f1f1f36f3759eecfb6070b2372462ee454b700
Author: xin Wu <[email protected]>
Date: 2016-05-13T00:39:45Z
SPARK-15269: set locationUFI to the non-hive compatible metastore table
commit 58ad82db21f90b571d70371ff25c167ecda17720
Author: xin Wu <[email protected]>
Date: 2016-05-14T20:16:11Z
SPARK-15269: only for external datasource table
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]