jaylmiller commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1453611831

   > > like > 512
   > 
   > How common are such batches in practice? I guess I'm wondering if the 
added complexity is justified for what is effectively a degenerate case that 
will cause issues far beyond just for sort?
   > 
   > _Btw DynComparator has known issues w.r.t sorting nulls, and I had hoped 
to eventually deprecate and remove it_ - 
[apache/arrow-rs#2687](https://github.com/apache/arrow-rs/issues/2687)
   
   No 512 is way too small @tustvold . So for the sort bench, we are seeing 
regression when the `execute` call is sorting a single batch of size 12500 
(total benchmark input size is 100000, broken up into 8 partitions), this 
occurs when partitioning is preserved since each partition is sorted 
separately. When partitioning is not preserved, and all batches are sorted 
together, we see significant perf improvements. Additionally when partitioning 
is preserved, but the input data is all skewed to a single partition, we see 
the same perf improvement (as expected). Here are the bench results for each of 
those scenarios:
   ```
   group                                                                        
  main-sort                                rows-sort
   -----                                                                        
  ---------                                ---------
   sort mixed tuple                                                             
  1.00     29.5±2.83ms        ? ?/sec      1.04     30.5±3.23ms        ? ?/sec
   sort mixed tuple preserve partitioning                                       
  1.00      4.7±0.94ms        ? ?/sec      1.52      7.1±0.64ms        ? ?/sec
   sort mixed tuple preserve partitioning data skewed to first                  
  1.00     30.6±4.78ms        ? ?/sec      1.00     30.6±6.66ms        ? ?/sec
   sort mixed utf8 dictionary tuple                                             
  2.60    60.8±13.04ms        ? ?/sec      1.00     23.4±0.93ms        ? ?/sec
   sort mixed utf8 dictionary tuple preserve partitioning                       
  1.00      4.5±1.27ms        ? ?/sec      1.11      5.1±0.40ms        ? ?/sec
   sort mixed utf8 dictionary tuple preserve partitioning data skewed to first  
  2.24     54.0±4.22ms        ? ?/sec      1.00     24.1±2.17ms        ? ?/sec
   sort utf8 dictionary tuple                                                   
  2.32     54.7±7.35ms        ? ?/sec      1.00     23.6±3.48ms        ? ?/sec
   sort utf8 dictionary tuple preserve partitioning                             
  1.00      3.7±0.37ms        ? ?/sec      1.24      4.6±0.38ms        ? ?/sec
   sort utf8 dictionary tuple preserve partitioning data skewed to first        
  2.50     54.1±5.52ms        ? ?/sec      1.00     21.6±0.65ms        ? ?/sec
   sort utf8 tuple                                                              
  1.79    62.5±13.08ms        ? ?/sec      1.00     35.0±1.62ms        ? ?/sec
   sort utf8 tuple preserve partitioning                                        
  1.00      7.3±0.79ms        ? ?/sec      1.17      8.6±0.74ms        ? ?/sec
   sort utf8 tuple preserve partitioning data skewed to first                   
  1.54     54.5±5.11ms        ? ?/sec      1.00     35.4±2.18ms        ? ?/sec
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to