GitHub user darabos opened a pull request:

    https://github.com/apache/spark/pull/22673

    [SPARK-20144] Allow reading files in order with 
spark.sql.files.allowReordering=false

    ## What changes were proposed in this pull request?
    
    I'm adding `spark.sql.files.allowReordering`, defaulting to `true`. When 
set to `true` the behavior is as before. When set to `false`, the input files 
are read in alphabetical order. This means partitions are read in the 
`part-00001`, `part-00002`, `part-00003`... order, recovering the same ordering 
as before.
    
    While [*SPARK-20144*](https://issues.apache.org/jira/browse/SPARK-20144) 
has been closed as "Not A Problem", I think this is still a valuable feature. 
Spark has been 
[touted](https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html)
 as the best tool for sorting. It certainly can sort data. But without this 
change, it can not read back sorted data on the DataFrame API.
    
    My practical use case is that we allow users to run their SQL expressions 
through our UI. We also allow them to ask for the results to be persisted to 
Parquet files. We noticed that if they do an `ORDER BY`, the ordering is lost 
if they also ask for persistence. For example they might want to rank data 
points by a score, so they can later get the top 10 or top 10,000,000 entries 
easily. With this change we could fulfill this use case.
    
    Thanks for hearing me out! :blush: 
    
    ## How was this patch tested?
    
    New unit test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/darabos/spark darabos-spark-20144

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22673.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22673
    
----
commit 58cc4daa887482273487ea678d032aab2f5d36e0
Author: Daniel Darabos <darabos.daniel@...>
Date:   2018-10-08T15:31:23Z

    Allow reading files in order with spark.sql.files.allowReordering=false.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to