GitHub user JulienPeloton opened a pull request: https://github.com/apache/spark/pull/23025
[SPARK-26024][SQL]: Update documentation for repartitionByRange Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), I noticed the number of elements in each partition after repartitioning using `df.repartitionByRange` can vary for the same setup: ```scala // Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies ``` This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding. ## What changes were proposed in this pull request? Update the documentation (Spark & PySpark) to mention the impact of `spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting partitioned DataFrame. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JulienPeloton/spark SPARK-26024 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23025.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23025 ---- commit b47b6d0eb207021dbde38c35108fec0e39a64332 Author: Julien <peloton@...> Date: 2018-11-13T20:49:53Z Update documentation on repartitionByRange according to SPARK-26024 (Spark) commit 5a50282959c065f9797ce075239c30edeece4fbe Author: Julien <peloton@...> Date: 2018-11-13T20:50:38Z Update documentation on repartitionByRange according to SPARK-26024 (PySpark) ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org