[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

gatorsmile Mon, 21 Nov 2016 00:16:53 -0800

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/15951
  
    > The real issue is that a user that uses the spark.read code path can 
never clearly specify what the partition columns are. If you try to specify the 
fields in schema, we practically ignore what the user provides, and fall back 
to our inferred data types. What happens in the end is data corruption.
    
    For data source tables, the partition columns are part of data schema. 
Users do not need to know which columns are used for partitioning. If they can 
provide the right types, they should be able to see the expected data. 
    
    In the test cases, we can get the correct result with the following changes:
    ```
          spark.range(4).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1)
          val schema = new StructType()
            .add("part", LongType)
            .add("ex", ArrayType(StringType))
            .add("id", LongType)
          spark.read
            .schema(schema)
            .format("parquet")
            .load(src.toString).show()
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

Reply via email to