jorgecarleitao opened a new pull request #8480:
URL: https://github.com/apache/arrow/pull/8480
This PR is a proposal to change our iterators over `RecordBatch` to
`Future<RecordBatch>`, thereby making our compute operations over a single
batch as the "unit of work".
The rational here is that our expensive operations are over record batches,
which are the ones that benefit from being split in smaller units (and
multi-threaded).
This PR also places some `tokio::spawn` on some ops, with the expectation
that the scheduler can multi-thread them.
The micro-benchmarks are not very indicative as this affects larger sizes,
but as a rough idea, I get -50% improvement for larger math projections and +5%
to +20% degradation for aggregations.
The aggregations have a mutex blocking the whole thing, which may explain
the result.
<details>
<summary>Benchmarks</summary>
Math
```
sqrt_20_9 time: [7.6861 ms 7.7328 ms 7.7805 ms]
change: [+20.288% +21.552% +22.732%] (p = 0.00 <
0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
Benchmarking sqrt_20_12: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 9.4s, enable flat sampling, or reduce sample count to 50.
sqrt_20_12 time: [1.7891 ms 1.7954 ms 1.8020 ms]
change: [-38.768% -37.852% -36.578%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
sqrt_22_12 time: [8.4315 ms 8.5324 ms 8.6348 ms]
change: [-39.677% -38.299% -36.893%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
sqrt_22_14 time: [11.515 ms 11.759 ms 12.054 ms]
change: [-48.307% -47.200% -45.854%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
```
Aggregates:
```
aggregate_query_no_group_by 15 12
time: [831.70 us 836.65 us 842.22 us]
change: [+8.0888% +9.9290% +11.904%] (p = 0.00 <
0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
2 (2.00%) high mild
7 (7.00%) high severe
aggregate_query_group_by 15 12
time: [5.9246 ms 5.9763 ms 6.0367 ms]
change: [+3.1496% +4.1417% +5.1472%] (p = 0.00 <
0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe
aggregate_query_group_by_with_filter 15 12
time: [3.4054 ms 3.4322 ms 3.4597 ms]
change: [+26.844% +27.870% +28.979%] (p = 0.00 <
0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
```
</details>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]