[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Thu, 05 Aug 2021 13:36:35 -0700


alamb commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-893778745



   I got enough of the approach described by @Dandandan in 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-893232614 
working in https://github.com/apache/arrow-datafusion/pull/808/files to take 
some benchmarks.
   
   Results look as follows:
   
   ```
   (arrow_dev) alamb@MacBook-Pro:~/Software/arrow-datafusion$ critcmp gby_new 
gby_new2 master1 master2
   group                                                gby_new                 
               gby_new2                               master1                   
              master2
   -----                                                -------                 
               --------                               -------                   
              -------
   aggregate_query_group_by                             1.01      2.9±0.18ms    
    ? ?/sec    1.00      2.9±0.16ms        ? ?/sec    1.05      3.0±0.20ms      
  ? ?/sec     1.17      3.4±0.39ms        ? ?/sec
   aggregate_query_group_by_u64 15 12                   1.00      2.8±0.09ms    
    ? ?/sec    1.04      3.0±0.28ms        ? ?/sec    1.07      3.0±0.34ms      
  ? ?/sec     1.12      3.2±0.28ms        ? ?/sec
   aggregate_query_group_by_with_filter                 1.02      2.1±0.09ms    
    ? ?/sec    1.02      2.1±0.08ms        ? ?/sec    1.00      2.0±0.06ms      
  ? ?/sec     1.02      2.1±0.16ms        ? ?/sec
   aggregate_query_group_by_with_filter_u64 15 12       1.02      2.0±0.09ms    
    ? ?/sec    1.03      2.0±0.09ms        ? ?/sec    1.04      2.0±0.14ms      
  ? ?/sec     1.00  1973.5±90.65µs        ? ?/sec
   aggregate_query_no_group_by 15 12                    1.04  1201.4±42.15µs    
    ? ?/sec    1.00  1152.4±25.40µs        ? ?/sec    1.03  1190.9±51.39µs      
  ? ?/sec     1.10  1268.5±258.16µs        ? ?/sec
   aggregate_query_no_group_by_count_distinct_narrow    1.01      5.5±0.23ms    
    ? ?/sec    1.02      5.5±0.24ms        ? ?/sec    1.00      5.4±0.35ms      
  ? ?/sec     1.13      6.1±0.61ms        ? ?/sec
   aggregate_query_no_group_by_count_distinct_wide      1.00      7.5±0.44ms    
    ? ?/sec    1.01      7.6±0.36ms        ? ?/sec    1.03      7.8±0.62ms      
  ? ?/sec     1.00      7.5±0.35ms        ? ?/sec
   aggregate_query_no_group_by_min_max_f64              1.02  1191.2±61.07µs    
    ? ?/sec    1.00  1171.9±92.30µs        ? ?/sec    1.09  1279.7±145.46µs     
   ? ?/sec    1.01  1180.7±89.39µs        ? ?/sec
   ```
   
   The good news is that we didn't slow down; The not so good news is that it 
isn't much faster either, though this may be related to the specific benches 
used. I will see if I can run the tpch benchmarks and do some profiling of 
where time is being spent.
   
   Also, in the "good news" category is that I ran out of RAM trying to find 
hash collisions -- lol. I have an idea to fix the hashing function for the test 
but the current `create_hashes` function seems pretty good at avoiding 
collisions on u64 values.
   
   
   Benchmarks run like this:
   ```
   cargo bench --bench aggregate_query_sql -- --save-baseline gby_new
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to