crepererum opened a new pull request, #4371:
URL: https://github.com/apache/arrow-datafusion/pull/4371

   # Which issue does this PR close?
   Closes #3940.
   
   # Rationale for this change
   Ensure that users don't run out of memory while performing group-by 
operations. This is esp. important for servers or multi-tenant systems.
   
   # What changes are included in this PR?
   This is similar to #4202. It includes an additional type `StreamType` so we 
can double-check our test setup (namely: is the stream that we request actually 
the stream version we want).
   
   # Are these changes tested?
   Extended `test_oom`. Also here are the perf results:
   
   ```console
   ❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline 
issue3940d-pre                                                                  
                                                                                
                 [32/5030]
      Compiling datafusion v14.0.0 
(/home/mneumann/src/arrow-datafusion/datafusion/core)
      Compiling parquet-test-utils v0.1.0 
(/home/mneumann/src/arrow-datafusion/parquet-test-utils)
       Finished bench [optimized] target(s) in 5m 11s
        Running benches/aggregate_query_sql.rs 
(target/release/deps/aggregate_query_sql-0659981fac849434)
   aggregate_query_no_group_by 15 12
                           time:   [686.71 µs 688.17 µs 689.79 µs]
                           change: [+0.9423% +1.5099% +2.1284%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 11 outliers among 100 measurements (11.00%)
     3 (3.00%) low mild
     5 (5.00%) high mild
     3 (3.00%) high severe
   
   aggregate_query_no_group_by_min_max_f64
                           time:   [637.75 µs 640.91 µs 644.19 µs]
                           change: [+0.9089% +1.5396% +2.1740%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low mild
     6 (6.00%) high mild
     1 (1.00%) high severe
   
   aggregate_query_no_group_by_count_distinct_wide
                           time:   [2.5239 ms 2.5437 ms 2.5641 ms]
                           change: [+1.5365% +2.6581% +3.8220%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up 
for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 8.7s, enable flat sampling, or reduce sample count to 50.
   aggregate_query_no_group_by_count_distinct_narrow
                           time:   [1.7286 ms 1.7392 ms 1.7498 ms]
                           change: [+1.2376% +2.3389% +3.5532%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low severe
     1 (1.00%) low mild
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   aggregate_query_group_by
                           time:   [2.2890 ms 2.3063 ms 2.3241 ms]
                           change: [+1.7896% +2.8160% +3.7350%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 5.8s, enable flat sampling, or reduce sample count to 60.
   aggregate_query_group_by_with_filter
                           time:   [1.1419 ms 1.1444 ms 1.1472 ms]
                           change: [+1.1563% +1.6416% +2.1195%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   aggregate_query_group_by_u64 15 12
                           time:   [2.3083 ms 2.3237 ms 2.3394 ms]
                           change: [+1.7416% +2.7403% +3.7301%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 
3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 5.8s, enable flat sampling, or reduce sample count to 60.
   aggregate_query_group_by_with_filter_u64 15 12
                           time:   [1.1492 ms 1.1530 ms 1.1572 ms]
                           change: [+0.4978% +1.1567% +1.7562%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 9 outliers among 100 measurements (9.00%)
     2 (2.00%) low mild
     5 (5.00%) high mild
     2 (2.00%) high severe
   
   aggregate_query_group_by_u64_multiple_keys
                           time:   [15.070 ms 15.384 ms 15.706 ms]
                           change: [-2.0263% +0.9477% +4.0997%] (p = 0.55 > 
0.05)
                           No change in performance detected.
   
   aggregate_query_approx_percentile_cont_on_u64
                           time:   [3.8600 ms 3.8963 ms 3.9341 ms]
                           change: [-0.1229% +1.2647% +2.8308%] (p = 0.09 > 
0.05)
                           No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   aggregate_query_approx_percentile_cont_on_f32
                           time:   [3.2601 ms 3.2871 ms 3.3147 ms]
                           change: [-0.5989% +0.7136% +2.0221%] (p = 0.27 > 
0.05)
                           No change in performance detected.
   ```
   
   I think the regressions (<3%) are within the safety margin of such a crucial 
feature (and also within what a laptop could reliable reproduce).
   
   # Are there any user-facing changes?
   The V1 group-by op an now emit a `ResourceExhausted` error if it runs out of 
memory. Note that the error is kinda nested/wrapped due to #4172.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to