GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/15951

    [SPARK-18510] Fix data corruption from inferred partition column dataTypes

    ## What changes were proposed in this pull request?
    
    ### The Issue
    
    If I specify my schema when doing
    ```scala
    spark.read
      .schema(someSchemaWherePartitionColumnsAreStrings)
    ```
    but if the partition inference can infer it as IntegerType or I assume 
LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
generated, your data will be corrupted.
    
    ### Proposed solution
    
    The partition handling code path is kind of a mess. In my fix I'm probably 
adding to the mess, but at least trying to standardize the code path.
    
    The real issue is that a user that uses the `spark.read` code path can 
never clearly specify what the partition columns are. If you try to specify the 
fields in `schema`, we practically ignore what the user provides, and fall back 
to our inferred data types. What happens in the end is data corruption.
    
    My solution tries to fix this by always trying to infer partition columns 
the first time you specify the table. Once we find what the partition columns 
are, we try to find them in the user specified schema and use the dataType 
provided there, or fall back to the smallest common data type.
    
    We will ALWAYS append partition columns to the user's schema, even if they 
didn't ask for it. We will only use the data type they provided if they 
specified it. While this is confusing, this has been the behavior since Spark 
1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. 
We may revisit this decision later.
    
    A side effect of this PR is that we won't need 
https://github.com/apache/spark/pull/15942 if this PR goes in.
    
    ## How was this patch tested?
    
    Regression tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark partition-corruption

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15951.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15951
    
----
commit ef2e1c26cbe20c3f7b2fd6a4d528b2b29d842203
Author: Burak Yavuz <[email protected]>
Date:   2016-11-20T22:19:33Z

    Fix data corruption from inferred partition column dataTypes

commit 9080c4ed50e9ddf65d3e045e0604398618256fb4
Author: Burak Yavuz <[email protected]>
Date:   2016-11-20T22:26:08Z

    make naming and comments a bit more descriptive

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to