[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Fri, 06 Aug 2021 03:44:18 -0700


alamb commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-894173779



   Thanks @Dandandan  -- I think that some of the time 
creating/dropping/comparing `ScalarValue` will go away when I  complete the 
optimized implementation for all types in`ScalarValue::eq_array` -- as of now, 
grouping on any type other than Utf8, UInt64, F32 or F64 will take the slow 
path. 
   
   I also did some  some profiling with the tpch q1 (which has a group by on 
two keys and no joins); My conclusion from that exercise is that this approach 
about the same speed as the one on master
   
   Profiling command:
   ```shell
   cargo run --release --bin tpch -- benchmark datafusion --iterations 10 
--path ./data --format tbl --query 1 --batch-size 10000
   ```
   On master, `Query 1 avg time: 2874.42 ms`
   
   On the gby_null_new branch `Query 1 avg time: 2904.64 ms`
   
   Which is well within the error bounds of my measurement setup.
   
   My profiling suggests that Q1 spends  85% of the time parsing CSV data and 
approximately 15% of the time doing the aggregation. Of that 15%, 10% is in 
`create_hashes` and 3% is looking up in the hash table.
   
   My next steps are is going to be:
   1.  create a benchmark program that is not IO bound with string / string 
dictionary grouping keys for which I think this approach will be most beneficial
   2. Complete the optimized implementation of `ScalarValue::eq_array`, after 
which perhaps you can rerun your benchmarks and we can see if they get better
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to