GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/16326
[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data
Source Table with a Specified Path
### What changes were proposed in this pull request?
In Spark 2.1 (the default of `spark.sql.hive.manageFilesourcePartitions` is
set to `true`), if we create a parititoned data source table given a specified
path, it returns nothing when we try to query it. To get the data, we have to
manually issue a DDL to repair the table.
In Spark 2.0, it can return the data stored in the specified path, without
repairing the table. In Spark 2.1, if we set
`spark.sql.hive.manageFilesourcePartitions` to false, the behavior is the same
as Spark 2.0.
Below is the output of Spark 2.1.
```Scala
scala> spark.range(5).selectExpr("id as fieldOne", "id as
partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
scala> spark.sql("desc formatted test").show(50, false)
+----------------------------+----------------------------------------------------------------------+-------+
|col_name |data_type
|comment|
+----------------------------+----------------------------------------------------------------------+-------+
...
|Location:
|file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test| |
|Table Type: |MANAGED
| |
...
|Partition Provider: |Catalog
| |
+----------------------------+----------------------------------------------------------------------+-------+
scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using
parquet options (path
'file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test')
partitioned by (partCol)")
res3: org.apache.spark.sql.DataFrame = []
scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
+--------+-------+
```
This PR is to make it consistent with the behavior of Spark 2.0. no matter
whether `spark.sql.hive.manageFilesourcePartitions` is `true` or `false`. It
repairs the table when creating such a table. After the change, the behavior
becomes consistent with what we did for CTAS of partitioned data source tables.
### How was this patch tested?
Modified the existing test case.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark testtt
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16326.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16326
----
commit 40abcc281344923b33886243c83358f5084c2489
Author: gatorsmile <[email protected]>
Date: 2016-12-18T03:55:32Z
fix.
commit 3942c4ea53b199a476855a3f39087d893a4e900a
Author: gatorsmile <[email protected]>
Date: 2016-12-18T03:56:51Z
fix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]