GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19579
[SPARK-22356][SQL] data source table should support overlapped columns
between data and partition schema
## What changes were proposed in this pull request?
This is a regression introduced by #14207. After Spark 2.1, we store the
inferred schema when creating the table, to avoid inferring schema again at
read path. However, there is one special case: overlapped columns between data
and partition. For this case, it breaks the assumption of table schema that
there is on ovelap between data and partition schema, and partition columns
should be at the end. The result is, for Spark 2.1, the table scan has
incorrect schema that puts partition columns at the end. For Spark 2.2, we add
a check in CatalogTable to validate table schema, which fails at this case.
To fix this issue, a simple and safe approach is to fallback to old
behavior when overlapeed columns detected, i.e. store empty schema in metastore.
## How was this patch tested?
new regression test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark bug2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19579.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19579
----
commit 18907cb2359efb9b4e874482916de04af9cf90a2
Author: Wenchen Fan <[email protected]>
Date: 2017-10-26T01:26:39Z
overlapped columns between data and partition schema in data source tables
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]