jaylmiller commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454843441

   I ran some experiments investigating how batch size impacts performance when 
doing multi column sorts on a single record batch.  
   
   <img 
src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/mixed-tuple.png";
 >
   <img 
src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/utf8-tuple.png";
 >
   <img 
src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/dictionary-tuple.png";>
   <img 
src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/mixed-dictionary-tuple.png";>
   
   So the batch size theory seems wrong, but these results do demonstrate why 
the "preserve partitioning" cases are regressing. What's interesting is that 
while single batch sorting performance for the row format is actually worse, 
we're still getting significant performance increase when more than one batch 
is being sorted 🤔. For example, the benchmark comps for utf8-tuple
   ```
   group                                                                        
  main-sort                                rows-sort
   -----                                                                        
  ---------                                ---------
   sort utf8 tuple                                                              
  1.79    62.5±13.08ms        ? ?/sec      1.00     35.0±1.62ms        ? ?/sec
   sort utf8 tuple preserve partitioning                                        
  1.00      7.3±0.79ms        ? ?/sec      1.17      8.6±0.74ms        ? ?/sec
   ```
   
   methodology: https://github.com/jaylmiller/inspect-arrow-sort. the actual 
sorting is [right 
here](https://github.com/jaylmiller/inspect-arrow-sort/blob/main/src/lib.rs#L23-L75)
 and pretty much entirely lifted from the PR.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to