Github user sddyljsx commented on the issue:
https://github.com/apache/spark/pull/21859
This optimization is only for SQL, but other places also use
RangePartitioner. What it can affect other places?
The failed UTs are caused by
```
else if (sampleCacheEnabled && numItems == numSampled) {
// get the sampled data
sampledArray = sketched.foldLeft(Array.empty[K])((total, sample) =>
{
total ++ sample._3
})
Array.empty
}
```
the RangePartitioner's rangeBounds will be empty. I think the rangeBounds
will have no use if the optimization works, so an empty array is returned. But
it may cause errors, so I change the code, and always get the rangeBounds. By
this way, the only diff of the new RangePartitioner is that it stores an extra
smapledArray which won't effect other places.
what's more,
```
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true,
val samplePointsPerPartitionHint: Int = 20,
val sampleCacheEnabled: Boolean = false)
```
the default value of the sampleCacheEnabled is false, only in this place it
is true
```
ShuffleExchangeExec.scala
new RangePartitioner(
numPartitions,
rddForSampling,
ascending = true,
samplePointsPerPartitionHint =
SQLConf.get.rangeExchangeSampleSizePerPartition,
sampleCacheEnabled = SQLConf.get.rangeExchangeSampleCacheEnabled)
```
it will be safe.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]