Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21291#discussion_r188979880 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala --- @@ -39,7 +39,9 @@ class ConfigBehaviorSuite extends QueryTest with SharedSQLContext { def computeChiSquareTest(): Double = { val n = 10000 // Trigger a sort - val data = spark.range(0, n, 1, 1).sort('id.desc) + // Range has range partitioning in its output now. To have a range shuffle, we + // need to run a repartition first. + val data = spark.range(0, n, 1, 1).repartition(10).sort('id.desc) --- End diff -- This is a good point. This is query plan and partition size for `spark.range(0, n, 1, 1).repartition(10).sort('id.desc)`, when we set `SQLConf.RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION` to 1: ``` == Physical Plan == *(2) Sort [id#15L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(id#15L DESC NULLS LAST, 4) +- Exchange RoundRobinPartitioning(10) +- *(1) Range (0, 10000, step=1, splits=1) 1666, 3766, 2003, 2565 ``` `spark.range(0, n, 1, 10).sort('id.desc)`: ``` == Physical Plan == *(2) Sort [id#13L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(id#13L DESC NULLS LAST, 4) +- *(1) Range (0, 10000, step=1, splits=10) (2835, 2469, 2362, 2334) ``` Because `repartition` shuffles data with `RoundRobinPartitioning`, I guess that it makes the worse sampling for range exchange. Without `repartition`, `Range`'s output is already range partitioning, so it can get sampling leading better range boundaries.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org