[GitHub] [arrow-datafusion] jaylmiller commented on pull request #5292: use row encoding for SortExec

via GitHub Tue, 14 Mar 2023 08:33:14 -0700


jaylmiller commented on PR #5292:
URL: 
https://github.com/apache/arrow-datafusion/pull/5292#issuecomment-1468327766


   Coding-wise everything is finished and code is ready to review. But in terms 
of bench results, I'm not 100% confident yet. 
   
   Sort micro-benchmarks are looking pretty good: significant improvements on 
cases where row encoding is actually used, minor regressions--mostly within 
error bars--on cases without row encoding but of course more experienced 
contributors would know better about how significant these regressions actually 
are (I'll repost them at the bottom): 
   
   ```
   group                                                     main-sort          
                      rows-sort
   -----                                                     ---------          
                      ---------
   sort f64                                                  1.00     
10.8±0.23ms        ? ?/sec      1.04     11.2±0.93ms        ? ?/sec
   sort f64 preserve partitioning                            1.00      
4.0±0.27ms        ? ?/sec      1.04      4.1±0.28ms        ? ?/sec
   sort i64                                                  1.00      
9.5±0.55ms        ? ?/sec      1.09     10.3±0.74ms        ? ?/sec
   sort i64 preserve partitioning                            1.00      
3.3±0.10ms        ? ?/sec      1.06      3.5±0.13ms        ? ?/sec
   sort mixed tuple                                          1.28     
28.3±3.35ms        ? ?/sec      1.00     22.2±1.60ms        ? ?/sec
   sort mixed tuple preserve partitioning                    1.00      
3.6±0.17ms        ? ?/sec      1.15      4.1±1.09ms        ? ?/sec
   sort mixed utf8 dictionary tuple                          2.84     
52.7±8.27ms        ? ?/sec      1.00     18.6±1.29ms        ? ?/sec
   sort mixed utf8 dictionary tuple preserve partitioning    1.02      
4.2±0.92ms        ? ?/sec      1.00      4.1±0.55ms        ? ?/sec
   sort utf8 dictionary                                      1.00      
3.7±0.21ms        ? ?/sec      1.04      3.9±0.33ms        ? ?/sec
   sort utf8 dictionary preserve partitioning                1.00  
1487.2±1444.67µs        ? ?/sec    1.01  1502.8±315.79µs        ? ?/sec
   sort utf8 dictionary tuple                                3.26    
57.0±11.35ms        ? ?/sec      1.00     17.5±2.08ms        ? ?/sec
   sort utf8 dictionary tuple preserve partitioning          1.13      
4.1±1.08ms        ? ?/sec      1.00      3.6±0.52ms        ? ?/sec
   sort utf8 high cardinality                                1.01     
28.0±3.70ms        ? ?/sec      1.00     27.6±3.81ms        ? ?/sec
   sort utf8 high cardinality preserve partitioning          1.00     
11.1±1.48ms        ? ?/sec      1.21     13.5±3.38ms        ? ?/sec
   sort utf8 low cardinality                                 1.00     
15.3±5.08ms        ? ?/sec      1.10     16.9±6.20ms        ? ?/sec
   sort utf8 low cardinality preserve partitioning           1.03      
8.1±2.21ms        ? ?/sec      1.00      7.8±1.75ms        ? ?/sec
   sort utf8 tuple                                           1.96     
56.8±8.36ms        ? ?/sec      1.00     29.0±4.82ms        ? ?/sec
   sort utf8 tuple preserve partitioning                     1.02      
6.7±0.95ms        ? ?/sec      1.00      6.5±0.46ms        ? ?/sec
   ```
   
   
   In summary, I'd like to get an opinion on these micro bench results. And 
then also ideally, we can run the e2e bench comparisons (#5561) on `tpch` and 
`parquet` and get a bit more data on whether this change is worth merging.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jaylmiller commented on pull request #5292: use row encoding for SortExec

Reply via email to