[
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642401#comment-16642401
]
Daniel Darabos commented on SPARK-20144:
----------------------------------------
Sorry, I had an idea for a quick fix for this and sent a pull request without
discussing it first. Let me copy the rationale from the PR:
I'm adding {{spark.sql.files.allowReordering}}, defaulting to {{true}}. When
set to {{true}} the behavior is as before. When set to {{false}}, the input
files are read in alphabetical order. This means partitions are read in the
{{part-00001}}, {{part-00002}}, {{part-00003}}... order, recovering the same
ordering as before.
While *SPARK-20144* has been closed as "Not A Problem", I think this is still a
valuable feature. Spark has been
[touted|https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html]
as the best tool for sorting. It certainly can sort data. But without this
change, it can not read back sorted data on the DataFrame API.
My practical use case is that we allow users to run their SQL expressions
through our UI. We also allow them to ask for the results to be persisted to
Parquet files. We noticed that if they do an {{ORDER BY}}, the ordering is lost
if they also ask for persistence. For example they might want to rank data
points by a score, so they can later get the top 10 or top 10,000,000 entries
easily. With this change we could fulfill this use case.
The fix is small and safe. (25 lines including test and docs, only changes
behavior when new flag is set.) Is there a reason not to do this?
> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: Li Jin
> Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is
> when we read parquet files in 2.0.2, the ordering of rows in the resulting
> dataframe is not the same as the ordering of rows in the dataframe that the
> parquet file was reproduced with.
> This is because FileSourceStrategy.scala combines the parquet files into
> fewer partitions and also reordered them. This breaks our workflows because
> they assume the ordering of the data.
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with
> 2.1.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]