GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19579

    [SPARK-22356][SQL] data source table should support overlapped columns 
between data and partition schema

    ## What changes were proposed in this pull request?
    
    This is a regression introduced by #14207. After Spark 2.1, we store the 
inferred schema when creating the table, to avoid inferring schema again at 
read path. However, there is one special case: overlapped columns between data 
and partition. For this case, it breaks the assumption of table schema that 
there is on ovelap between data and partition schema, and partition columns 
should be at the end. The result is, for Spark 2.1, the table scan has 
incorrect schema that puts partition columns at the end. For Spark 2.2, we add 
a check in CatalogTable to validate table schema, which fails at this case.
    
    To fix this issue, a simple and safe approach is to fallback to old 
behavior when overlapeed columns detected, i.e. store empty schema in metastore.
    
    ## How was this patch tested?
    
    new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19579
    
----
commit 18907cb2359efb9b4e874482916de04af9cf90a2
Author: Wenchen Fan <[email protected]>
Date:   2017-10-26T01:26:39Z

    overlapped columns between data and partition schema in data source tables

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to