GitHub user darabos opened a pull request:
https://github.com/apache/spark/pull/22673
[SPARK-20144] Allow reading files in order with
spark.sql.files.allowReordering=false
## What changes were proposed in this pull request?
I'm adding `spark.sql.files.allowReordering`, defaulting to `true`. When
set to `true` the behavior is as before. When set to `false`, the input files
are read in alphabetical order. This means partitions are read in the
`part-00001`, `part-00002`, `part-00003`... order, recovering the same ordering
as before.
While [*SPARK-20144*](https://issues.apache.org/jira/browse/SPARK-20144)
has been closed as "Not A Problem", I think this is still a valuable feature.
Spark has been
[touted](https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html)
as the best tool for sorting. It certainly can sort data. But without this
change, it can not read back sorted data on the DataFrame API.
My practical use case is that we allow users to run their SQL expressions
through our UI. We also allow them to ask for the results to be persisted to
Parquet files. We noticed that if they do an `ORDER BY`, the ordering is lost
if they also ask for persistence. For example they might want to rank data
points by a score, so they can later get the top 10 or top 10,000,000 entries
easily. With this change we could fulfill this use case.
Thanks for hearing me out! :blush:
## How was this patch tested?
New unit test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/darabos/spark darabos-spark-20144
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22673.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22673
----
commit 58cc4daa887482273487ea678d032aab2f5d36e0
Author: Daniel Darabos <darabos.daniel@...>
Date: 2018-10-08T15:31:23Z
Allow reading files in order with spark.sql.files.allowReordering=false.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]