[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Daniel Darabos (JIRA) Mon, 08 Oct 2018 13:02:25 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642401#comment-16642401
 ]


Daniel Darabos commented on SPARK-20144:
----------------------------------------

Sorry, I had an idea for a quick fix for this and sent a pull request without 
discussing it first. Let me copy the rationale from the PR:

I'm adding {{spark.sql.files.allowReordering}}, defaulting to {{true}}. When 
set to {{true}} the behavior is as before. When set to {{false}}, the input 
files are read in alphabetical order. This means partitions are read in the 
{{part-00001}}, {{part-00002}}, {{part-00003}}... order, recovering the same 
ordering as before.

While *SPARK-20144* has been closed as "Not A Problem", I think this is still a 
valuable feature. Spark has been 
[touted|https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html]
 as the best tool for sorting. It certainly can sort data. But without this 
change, it can not read back sorted data on the DataFrame API.

My practical use case is that we allow users to run their SQL expressions 
through our UI. We also allow them to ask for the results to be persisted to 
Parquet files. We noticed that if they do an {{ORDER BY}}, the ordering is lost 
if they also ask for persistence. For example they might want to rank data 
points by a score, so they can later get the top 10 or top 10,000,000 entries 
easily. With this change we could fulfill this use case.

The fix is small and safe. (25 lines including test and docs, only changes 
behavior when new flag is set.) Is there a reason not to do this?

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>            Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Reply via email to