Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/19764
  
    @caneGuy adding the second ordering makes the sort much more fine grained, 
this means that the range partitioner is probably going to use different range 
boundaries using all the ordering parameters.
    
    Let's work through an example here. Say we want to range partition `col1` 
here to 5 partitions, `col2` will be added by to the secondary ordering by your 
PR. `col1` has only 3 distinct values to keep the example simple. The following 
table shows how the columns can map to partition in the current situation and 
in the new situation:
    
    | col1 | col2 | current partitioning | PR partitioning |
    | ---- | ---- | ------------------- | --------------- |
    | 0  | 0 | 0  | 0 |
    | 1  | 0 | 1  | 1 |
    | 1  | 0 | 1  | 1 |
    | 1  | 1 | 1  | 2 |
    | 2  | 0 | 2 | 3 |
    | 2  | 1 | 2 | 4|
    
    Notice how by adding the secondary ordering the partitioning changes. This 
breaks the contract of the range partitioner, and will lead to invalid results 
if we use this for anything besides a top level global ordering (i.e. `select * 
from tbl_x order by col1`).
    
    If you only want to improve the performance of top level global orderings 
then it is better to write an optimizer rule/planning strategy that explicitly 
adds the secondary ordering columns to the ordering. The only caveat here might 
be that this can hurt the performance in the non-skewed; spark uses radix 
sorting for simple sorts which is generally faster than timsort which is used 
for complex cases.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to