[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Daniel Darabos (JIRA) Mon, 15 Oct 2018 13:13:36 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650722#comment-16650722
 ]


Daniel Darabos commented on SPARK-20144:
----------------------------------------

Yeah, I'm not too happy about the alphabetical ordering either. I thought I 
could simply not sort, and get the "original" order. But at the point where I 
made my change, the files are already in a jumbled order. Maybe it's the file 
system listing order, which could be anything.

99% of the time I'm just reading back a single partitioned Parquet file. In 
this case the alphabetical ordering is the right ordering. ({{part-00001}}, 
{{part-00002}}, ...) The rows of the resulting DataFrame will be in the same 
order as originally. So I think this issue is satisfied by the change. (The 
test also demonstrates this.)

The 1% case (for me) is when I'm reading back multiple Parquet files with a 
glob in a single {{spark.read.parquet("dir-\{0,5,10}")}} call. In this case it 
would be nice to respect the order given by the user ({{dir-0}}, {{dir-5}}, 
{{dir-10}}). My PR messes this up. ({{dir-0}}, {{dir-10}}, {{dir-5}}) But at 
least the partitions within each Parquet file will be contiguous. That's still 
an improvement.

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>            Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Reply via email to