[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec

via GitHub Tue, 07 Mar 2023 08:06:26 -0800


jaylmiller commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1458428345


   My current thinking is that since the single batch scenario is a special 
case with its own code path:
   
   
https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_plan/sorts/sort.rs#L286-L294
   
   and based on findings that row format sorting can often perform worse on 
single batches, 
   it seems that the performance benefits of the row encoding are gained during 
the step of 
   the algorithm that merges the mem sorted batches. One possible reason that 
row encoding perf benefits are seen
   when a merge is performed, is that we can use a sorting algorithm that 
benefits from the 
   fact that the data is concat'd sorted sequences (in rust, regular sort is a 
mergesort and unstable sort is a quicksort).
   Whereas without the row encoding, we use `lexsort_to_indices` which doesn't 
let us benefit
   from the data being sorted sequences.
   
   So I'm thinking if we gate row encoding usage based on whether or not that 
merge will happen, 
   we can keep the perf advantages and remove these regressions. Here's my 
current bench results:
   ```
   group                                                     main-sort          
                      rows-sort
   -----                                                     ---------          
                      ---------
   sort f64                                                  1.00     
10.8±0.23ms        ? ?/sec      1.04     11.2±0.93ms        ? ?/sec
   sort f64 preserve partitioning                            1.00      
4.0±0.27ms        ? ?/sec      1.04      4.1±0.28ms        ? ?/sec
   sort i64                                                  1.00      
9.5±0.55ms        ? ?/sec      1.09     10.3±0.74ms        ? ?/sec
   sort i64 preserve partitioning                            1.00      
3.3±0.10ms        ? ?/sec      1.06      3.5±0.13ms        ? ?/sec
   sort mixed tuple                                          1.28     
28.3±3.35ms        ? ?/sec      1.00     22.2±1.60ms        ? ?/sec
   sort mixed tuple preserve partitioning                    1.00      
3.6±0.17ms        ? ?/sec      1.15      4.1±1.09ms        ? ?/sec
   sort mixed utf8 dictionary tuple                          2.84     
52.7±8.27ms        ? ?/sec      1.00     18.6±1.29ms        ? ?/sec
   sort mixed utf8 dictionary tuple preserve partitioning    1.02      
4.2±0.92ms        ? ?/sec      1.00      4.1±0.55ms        ? ?/sec
   sort utf8 dictionary                                      1.00      
3.7±0.21ms        ? ?/sec      1.04      3.9±0.33ms        ? ?/sec
   sort utf8 dictionary preserve partitioning                1.00  
1487.2±1444.67µs        ? ?/sec    1.01  1502.8±315.79µs        ? ?/sec
   sort utf8 dictionary tuple                                3.26    
57.0±11.35ms        ? ?/sec      1.00     17.5±2.08ms        ? ?/sec
   sort utf8 dictionary tuple preserve partitioning          1.13      
4.1±1.08ms        ? ?/sec      1.00      3.6±0.52ms        ? ?/sec
   sort utf8 high cardinality                                1.01     
28.0±3.70ms        ? ?/sec      1.00     27.6±3.81ms        ? ?/sec
   sort utf8 high cardinality preserve partitioning          1.00     
11.1±1.48ms        ? ?/sec      1.21     13.5±3.38ms        ? ?/sec
   sort utf8 low cardinality                                 1.00     
15.3±5.08ms        ? ?/sec      1.10     16.9±6.20ms        ? ?/sec
   sort utf8 low cardinality preserve partitioning           1.03      
8.1±2.21ms        ? ?/sec      1.00      7.8±1.75ms        ? ?/sec
   sort utf8 tuple                                           1.96     
56.8±8.36ms        ? ?/sec      1.00     29.0±4.82ms        ? ?/sec
   sort utf8 tuple preserve partitioning                     1.02      
6.7±0.95ms        ? ?/sec      1.00      6.5±0.46ms        ? ?/sec
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec

Reply via email to