alamb commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-893778745
I got enough of the approach described by @Dandandan in https://github.com/apache/arrow-datafusion/issues/790#issuecomment-893232614 working in https://github.com/apache/arrow-datafusion/pull/808/files to take some benchmarks. Results look as follows: ``` (arrow_dev) alamb@MacBook-Pro:~/Software/arrow-datafusion$ critcmp gby_new gby_new2 master1 master2 group gby_new gby_new2 master1 master2 ----- ------- -------- ------- ------- aggregate_query_group_by 1.01 2.9±0.18ms ? ?/sec 1.00 2.9±0.16ms ? ?/sec 1.05 3.0±0.20ms ? ?/sec 1.17 3.4±0.39ms ? ?/sec aggregate_query_group_by_u64 15 12 1.00 2.8±0.09ms ? ?/sec 1.04 3.0±0.28ms ? ?/sec 1.07 3.0±0.34ms ? ?/sec 1.12 3.2±0.28ms ? ?/sec aggregate_query_group_by_with_filter 1.02 2.1±0.09ms ? ?/sec 1.02 2.1±0.08ms ? ?/sec 1.00 2.0±0.06ms ? ?/sec 1.02 2.1±0.16ms ? ?/sec aggregate_query_group_by_with_filter_u64 15 12 1.02 2.0±0.09ms ? ?/sec 1.03 2.0±0.09ms ? ?/sec 1.04 2.0±0.14ms ? ?/sec 1.00 1973.5±90.65µs ? ?/sec aggregate_query_no_group_by 15 12 1.04 1201.4±42.15µs ? ?/sec 1.00 1152.4±25.40µs ? ?/sec 1.03 1190.9±51.39µs ? ?/sec 1.10 1268.5±258.16µs ? ?/sec aggregate_query_no_group_by_count_distinct_narrow 1.01 5.5±0.23ms ? ?/sec 1.02 5.5±0.24ms ? ?/sec 1.00 5.4±0.35ms ? ?/sec 1.13 6.1±0.61ms ? ?/sec aggregate_query_no_group_by_count_distinct_wide 1.00 7.5±0.44ms ? ?/sec 1.01 7.6±0.36ms ? ?/sec 1.03 7.8±0.62ms ? ?/sec 1.00 7.5±0.35ms ? ?/sec aggregate_query_no_group_by_min_max_f64 1.02 1191.2±61.07µs ? ?/sec 1.00 1171.9±92.30µs ? ?/sec 1.09 1279.7±145.46µs ? ?/sec 1.01 1180.7±89.39µs ? ?/sec ``` The good news is that we didn't slow down; The not so good news is that it isn't much faster either, though this may be related to the specific benches used. I will see if I can run the tpch benchmarks and do some profiling of where time is being spent. Also, in the "good news" category is that I ran out of RAM trying to find hash collisions -- lol. I have an idea to fix the hashing function for the test but the current `create_hashes` function seems pretty good at avoiding collisions on u64 values. Benchmarks run like this: ``` cargo bench --bench aggregate_query_sql -- --save-baseline gby_new ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
