[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

brkyvz Mon, 21 Nov 2016 00:23:43 -0800

Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/15951
  
    True. But there's no reason "part" and "id" can't be strings right?
    
    On Nov 21, 2016 12:16 AM, "Xiao Li" <[email protected]> wrote:
    
    > The real issue is that a user that uses the spark.read code path can never
    > clearly specify what the partition columns are. If you try to specify the
    > fields in schema, we practically ignore what the user provides, and fall
    > back to our inferred data types. What happens in the end is data 
corruption.
    >
    > For data source tables, the partition columns are part of data schema.
    > Users do not need to know which columns are used for partitioning. If they
    > can provide the right types, they should be able to see the expected data.
    >
    > In the test cases, we can get the correct result with the following
    > changes:
    >
    >       spark.range(4).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1)
    >       val schema = new StructType()
    >         .add("part", LongType)
    >         .add("ex", ArrayType(StringType))
    >         .add("id", LongType)
    >       spark.read
    >         .schema(schema)
    >         .format("parquet")
    >         .load(src.toString).show()
    >
    > â
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/15951#issuecomment-261871343>, or 
mute
    > the thread
    > 
<https://github.com/notifications/unsubscribe-auth/AFACe7DmaOEHYPu22xbaNPZZl5OKZMwvks5rAVNagaJpZM4K3rkM>
    > .
    >




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

Reply via email to