GitHub user JulienPeloton opened a pull request:

    https://github.com/apache/spark/pull/23025

    [SPARK-26024][SQL]: Update documentation for repartitionByRange

    Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), 
I noticed the number of elements in each partition after repartitioning using 
`df.repartitionByRange` can vary for the same setup:
    
    ```scala
    // Shuffle numbers from 0 to 1000, and make a DataFrame
    val df = Random.shuffle(0.to(1000)).toDF("val")
    
    // Repartition it using 3 partitions
    // Sum up number of elements in each partition, and collect it.
    // And do it several times
    for (i <- 0 to 9) {
      var counts = df.repartitionByRange(3, col("val"))
        .mapPartitions{part => Iterator(part.size)}
        .collect()
      println(counts.toList)
    }
    // -> the number of elements in each partition varies
    ```
    
    This is expected as for performance reasons this method uses sampling to 
estimate the ranges (with default size of 100). Hence, the output may not be 
consistent, since sampling can return different values. But documentation was 
not mentioning it at all, leading to misunderstanding.
    
    ## What changes were proposed in this pull request?
    
    Update the documentation (Spark & PySpark) to mention the impact of 
`spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting 
partitioned DataFrame.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JulienPeloton/spark SPARK-26024

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23025
    
----
commit b47b6d0eb207021dbde38c35108fec0e39a64332
Author: Julien <peloton@...>
Date:   2018-11-13T20:49:53Z

    Update documentation on repartitionByRange according to SPARK-26024 (Spark)

commit 5a50282959c065f9797ce075239c30edeece4fbe
Author: Julien <peloton@...>
Date:   2018-11-13T20:50:38Z

    Update documentation on repartitionByRange according to SPARK-26024 
(PySpark)

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to