crepererum opened a new pull request, #4371:
URL: https://github.com/apache/arrow-datafusion/pull/4371
# Which issue does this PR close?
Closes #3940.
# Rationale for this change
Ensure that users don't run out of memory while performing group-by
operations. This is esp. important for servers or multi-tenant systems.
# What changes are included in this PR?
This is similar to #4202. It includes an additional type `StreamType` so we
can double-check our test setup (namely: is the stream that we request actually
the stream version we want).
# Are these changes tested?
Extended `test_oom`. Also here are the perf results:
```console
❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline
issue3940d-pre
[32/5030]
Compiling datafusion v14.0.0
(/home/mneumann/src/arrow-datafusion/datafusion/core)
Compiling parquet-test-utils v0.1.0
(/home/mneumann/src/arrow-datafusion/parquet-test-utils)
Finished bench [optimized] target(s) in 5m 11s
Running benches/aggregate_query_sql.rs
(target/release/deps/aggregate_query_sql-0659981fac849434)
aggregate_query_no_group_by 15 12
time: [686.71 µs 688.17 µs 689.79 µs]
change: [+0.9423% +1.5099% +2.1284%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) low mild
5 (5.00%) high mild
3 (3.00%) high severe
aggregate_query_no_group_by_min_max_f64
time: [637.75 µs 640.91 µs 644.19 µs]
change: [+0.9089% +1.5396% +2.1740%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
6 (6.00%) high mild
1 (1.00%) high severe
aggregate_query_no_group_by_count_distinct_wide
time: [2.5239 ms 2.5437 ms 2.5641 ms]
change: [+1.5365% +2.6581% +3.8220%] (p = 0.00 <
0.05)
Performance has regressed.
Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up
for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 8.7s, enable flat sampling, or reduce sample count to 50.
aggregate_query_no_group_by_count_distinct_narrow
time: [1.7286 ms 1.7392 ms 1.7498 ms]
change: [+1.2376% +2.3389% +3.5532%] (p = 0.00 <
0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
aggregate_query_group_by
time: [2.2890 ms 2.3063 ms 2.3241 ms]
change: [+1.7896% +2.8160% +3.7350%] (p = 0.00 <
0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 5.8s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter
time: [1.1419 ms 1.1444 ms 1.1472 ms]
change: [+1.1563% +1.6416% +2.1195%] (p = 0.00 <
0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
aggregate_query_group_by_u64 15 12
time: [2.3083 ms 2.3237 ms 2.3394 ms]
change: [+1.7416% +2.7403% +3.7301%] (p = 0.00 <
0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for
3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 5.8s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter_u64 15 12
time: [1.1492 ms 1.1530 ms 1.1572 ms]
change: [+0.4978% +1.1567% +1.7562%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
aggregate_query_group_by_u64_multiple_keys
time: [15.070 ms 15.384 ms 15.706 ms]
change: [-2.0263% +0.9477% +4.0997%] (p = 0.55 >
0.05)
No change in performance detected.
aggregate_query_approx_percentile_cont_on_u64
time: [3.8600 ms 3.8963 ms 3.9341 ms]
change: [-0.1229% +1.2647% +2.8308%] (p = 0.09 >
0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
aggregate_query_approx_percentile_cont_on_f32
time: [3.2601 ms 3.2871 ms 3.3147 ms]
change: [-0.5989% +0.7136% +2.0221%] (p = 0.27 >
0.05)
No change in performance detected.
```
I think the regressions (<3%) are within the safety margin of such a crucial
feature (and also within what a laptop could reliable reproduce).
# Are there any user-facing changes?
The V1 group-by op an now emit a `ResourceExhausted` error if it runs out of
memory. Note that the error is kinda nested/wrapped due to #4172.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]