[
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951073#comment-15951073
]
Li Jin commented on SPARK-20144:
--------------------------------
I totally agree Correctness takes precedence. If sorting is the only way, we
will do that, but I think there is way we can maintain ordering in parquet
format.
Parquet itself doesn't change the ordering, data in parquet is stored with
parquet_file_0, parquet_file_1 ... and data are ordered within those files.
However, it is FileSourceStrategy
(https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L168)
that resorts parquet files and end up changing the ordering.
If the expected semantics of Parquet doesn't maintain order, I won't complain
the behavior of spark.read.parquet, but it seems it's Catalyst that is changing
the ordering here.
> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is
> when we read parquet files in 2.0.2, the ordering of rows in the resulting
> dataframe is not the same as the ordering of rows in the dataframe that the
> parquet file was reproduced with.
> This is because FileSourceStrategy.scala combines the parquet files into
> fewer partitions and also reordered them. This breaks our workflows because
> they assume the ordering of the data.
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with
> 2.1.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]