[GitHub] [arrow-datafusion] crepererum opened a new pull request, #4202: wire memory management into `GroupedHashAggregateStreamV2`

GitBox Mon, 14 Nov 2022 02:36:56 -0800


crepererum opened a new pull request, #4202:
URL: https://github.com/apache/arrow-datafusion/pull/4202


   # Which issue does this PR close?
   Doesn't close, but works towards #3940 (need to migrate V1 as well
   
   # Rationale for this change
   Ensure that users don't run out of memory while performing group-by 
operations. This is esp. important for servers or multi-tenant systems.
   
   # What changes are included in this PR?
   - small clean up regarding async usage (first commit)
   - use a a nested construct (`BoxStream`) for `GroupedHashAggregateStreamV2` 
so we can call into the async memory manager (I thought about NOT doing this 
but I think it's worth to consider because on the long run a group-by can get 
another splillable operation to spill)
   
   # Are these changes tested?
   - new test (`test_oom`)
   - perf results (see below)
   
   Perf results:
   
   ```text
   ❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline 
issue3940a-pre
       Finished bench [optimized] target(s) in 0.08s
        Running benches/aggregate_query_sql.rs 
(target/release/deps/aggregate_query_sql-e9e315ab7a06a262)
   aggregate_query_no_group_by 15 12
                           time:   [654.77 µs 655.49 µs 656.29 µs]
                           change: [-1.6711% -1.2910% -0.8435%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 9 outliers among 100 measurements (9.00%)
     1 (1.00%) low mild
     5 (5.00%) high mild
     3 (3.00%) high severe
   
   aggregate_query_no_group_by_min_max_f64
                           time:   [579.93 µs 580.59 µs 581.27 µs]
                           change: [-3.8985% -3.2219% -2.6198%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     1 (1.00%) low severe
     3 (3.00%) low mild
     1 (1.00%) high mild
     4 (4.00%) high severe
   
   aggregate_query_no_group_by_count_distinct_wide
                           time:   [2.4610 ms 2.4801 ms 2.4990 ms]
                           change: [-2.9300% -1.8414% -0.7493%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   
   Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up 
for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 8.4s, enable flat sampling, or reduce sample count to 50.
   aggregate_query_no_group_by_count_distinct_narrow
                           time:   [1.6578 ms 1.6661 ms 1.6743 ms]
                           change: [-4.5391% -3.5033% -2.5050%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     1 (1.00%) low severe
     2 (2.00%) low mild
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   aggregate_query_group_by
                           time:   [2.1767 ms 2.2045 ms 2.2486 ms]
                           change: [-4.1048% -2.5858% -0.3237%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 5.5s, enable flat sampling, or reduce sample count to 60.
   aggregate_query_group_by_with_filter
                           time:   [1.0916 ms 1.0927 ms 1.0941 ms]
                           change: [-0.8524% -0.4230% -0.0724%] (p = 0.02 < 
0.05)
                           Change within noise threshold.
   Found 9 outliers among 100 measurements (9.00%)
     2 (2.00%) low severe
     1 (1.00%) low mild
     4 (4.00%) high mild
     2 (2.00%) high severe
   
   aggregate_query_group_by_u64 15 12
                           time:   [2.2108 ms 2.2238 ms 2.2368 ms]
                           change: [-4.2142% -3.2743% -2.3523%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 
3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 5.5s, enable flat sampling, or reduce sample count to 60.
   aggregate_query_group_by_with_filter_u64 15 12
                           time:   [1.0922 ms 1.0931 ms 1.0940 ms]
                           change: [-0.6872% -0.3192% +0.1193%] (p = 0.12 > 
0.05)
                           No change in performance detected.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) low mild
     4 (4.00%) high severe
   
   aggregate_query_group_by_u64_multiple_keys
                           time:   [14.714 ms 15.023 ms 15.344 ms]
                           change: [-5.8337% -2.7471% +0.2798%] (p = 0.09 > 
0.05)
                           No change in performance detected.
   
   aggregate_query_approx_percentile_cont_on_u64
                           time:   [3.7776 ms 3.8049 ms 3.8329 ms]
                           change: [-4.4977% -3.4230% -2.3282%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   aggregate_query_approx_percentile_cont_on_f32
                           time:   [3.1769 ms 3.1997 ms 3.2230 ms]
                           change: [-4.4664% -3.2597% -2.0955%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   ```
   
   I think the mild improvements are either flux or due to the somewhat
   manual memory allocation pattern.
   
   
   # Are there any user-facing changes?
   The V2 group-by op an now emit a `ResourceExhausted` error if it runs out of 
memory. Note that the error is kinda nested/wrapped due to #4172.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] crepererum opened a new pull request, #4202: wire memory management into `GroupedHashAggregateStreamV2`

Reply via email to