GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/19387

    [SPARK-22160][SQL] Allow changing sample points per partition in range 
shuffle exchange

    ## What changes were proposed in this pull request?
    Spark's RangePartitioner hard codes the number of sampling points per 
partition to be 20. This is sometimes too low. This ticket makes it 
configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and 
raises the default in Spark SQL to be 100.
    
    ## How was this patch tested?
    Added a pretty sophisticated test based on chi square test ...


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-22160

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19387.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19387
    
----
commit 843721b38e0a2385253053d475a855161dbc451c
Author: Reynold Xin <[email protected]>
Date:   2017-09-28T21:36:34Z

    [SPARK-22160][SQL] Allow changing sample points per partition in range 
shuffle exchange
    
    (cherry picked from commit 8e51ae52b6d54ed46a3441bbb83a8e93ba214410)
    Signed-off-by: Reynold Xin <[email protected]>

commit b46c92bf73b486ebc494b44be3c392f4bcd0a7c9
Author: Reynold Xin <[email protected]>
Date:   2017-09-28T22:51:04Z

    Add a test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to