Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/19764
@caneGuy adding the second ordering makes the sort much more fine grained,
this means that the range partitioner is probably going to use different range
boundaries using all the ordering parameters.
Let's work through an example here. Say we want to range partition `col1`
here to 5 partitions, `col2` will be added by to the secondary ordering by your
PR. `col1` has only 3 distinct values to keep the example simple. The following
table shows how the columns can map to partition in the current situation and
in the new situation:
| col1 | col2 | current partitioning | PR partitioning |
| ---- | ---- | ------------------- | --------------- |
| 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 |
| 1 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 0 | 2 | 3 |
| 2 | 1 | 2 | 4|
Notice how by adding the secondary ordering the partitioning changes. This
breaks the contract of the range partitioner, and will lead to invalid results
if we use this for anything besides a top level global ordering (i.e. `select *
from tbl_x order by col1`).
If you only want to improve the performance of top level global orderings
then it is better to write an optimizer rule/planning strategy that explicitly
adds the secondary ordering columns to the ordering. The only caveat here might
be that this can hurt the performance in the non-skewed; spark uses radix
sorting for simple sorts which is generally faster than timsort which is used
for complex cases.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]