jorgecarleitao opened a new pull request #8480:
URL: https://github.com/apache/arrow/pull/8480


   This PR is a proposal to change our iterators over `RecordBatch` to 
`Future<RecordBatch>`, thereby making our compute operations over a single 
batch as the "unit of work".
   
   The rational here is that our expensive operations are over record batches, 
which are the ones that benefit from being split in smaller units (and 
multi-threaded).
   
   This PR also places some `tokio::spawn` on some ops, with the expectation 
that the scheduler can multi-thread them.
   
   The micro-benchmarks are not very indicative as this affects larger sizes, 
but as a rough idea, I get -50% improvement for larger math projections and +5% 
to +20% degradation for aggregations.
   
   The aggregations have a mutex blocking the whole thing, which may explain 
the result.
   
   <details>
   <summary>Benchmarks</summary>
   
   Math
   ```
   sqrt_20_9               time:   [7.6861 ms 7.7328 ms 7.7805 ms]              
        
                           change: [+20.288% +21.552% +22.732%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   
   Benchmarking sqrt_20_12: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 9.4s, enable flat sampling, or reduce sample count to 50.
   sqrt_20_12              time:   [1.7891 ms 1.7954 ms 1.8020 ms]              
          
                           change: [-38.768% -37.852% -36.578%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     6 (6.00%) high mild
     4 (4.00%) high severe
   
   sqrt_22_12              time:   [8.4315 ms 8.5324 ms 8.6348 ms]              
         
                           change: [-39.677% -38.299% -36.893%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   sqrt_22_14              time:   [11.515 ms 11.759 ms 12.054 ms]              
         
                           change: [-48.307% -47.200% -45.854%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     4 (4.00%) high mild
     6 (6.00%) high severe
   ```
   
   Aggregates:
   ```
   aggregate_query_no_group_by 15 12                                            
                                
                           time:   [831.70 us 836.65 us 842.22 us]
                           change: [+8.0888% +9.9290% +11.904%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 10 outliers among 100 measurements (10.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     7 (7.00%) high severe
   
   aggregate_query_group_by 15 12                                               
                             
                           time:   [5.9246 ms 5.9763 ms 6.0367 ms]
                           change: [+3.1496% +4.1417% +5.1472%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 13 outliers among 100 measurements (13.00%)
     6 (6.00%) high mild
     7 (7.00%) high severe
   
   aggregate_query_group_by_with_filter 15 12                                   
                                          
                           time:   [3.4054 ms 3.4322 ms 3.4597 ms]
                           change: [+26.844% +27.870% +28.979%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   ```
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to