[
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650722#comment-16650722
]
Daniel Darabos commented on SPARK-20144:
----------------------------------------
Yeah, I'm not too happy about the alphabetical ordering either. I thought I
could simply not sort, and get the "original" order. But at the point where I
made my change, the files are already in a jumbled order. Maybe it's the file
system listing order, which could be anything.
99% of the time I'm just reading back a single partitioned Parquet file. In
this case the alphabetical ordering is the right ordering. ({{part-00001}},
{{part-00002}}, ...) The rows of the resulting DataFrame will be in the same
order as originally. So I think this issue is satisfied by the change. (The
test also demonstrates this.)
The 1% case (for me) is when I'm reading back multiple Parquet files with a
glob in a single {{spark.read.parquet("dir-\{0,5,10}")}} call. In this case it
would be nice to respect the order given by the user ({{dir-0}}, {{dir-5}},
{{dir-10}}). My PR messes this up. ({{dir-0}}, {{dir-10}}, {{dir-5}}) But at
least the partitions within each Parquet file will be contiguous. That's still
an improvement.
> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: Li Jin
> Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is
> when we read parquet files in 2.0.2, the ordering of rows in the resulting
> dataframe is not the same as the ordering of rows in the dataframe that the
> parquet file was reproduced with.
> This is because FileSourceStrategy.scala combines the parquet files into
> fewer partitions and also reordered them. This breaks our workflows because
> they assume the ordering of the data.
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with
> 2.1.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]