alamb commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-894173779
Thanks @Dandandan -- I think that some of the time creating/dropping/comparing `ScalarValue` will go away when I complete the optimized implementation for all types in`ScalarValue::eq_array` -- as of now, grouping on any type other than Utf8, UInt64, F32 or F64 will take the slow path. I also did some some profiling with the tpch q1 (which has a group by on two keys and no joins); My conclusion from that exercise is that this approach about the same speed as the one on master Profiling command: ```shell cargo run --release --bin tpch -- benchmark datafusion --iterations 10 --path ./data --format tbl --query 1 --batch-size 10000 ``` On master, `Query 1 avg time: 2874.42 ms` On the gby_null_new branch `Query 1 avg time: 2904.64 ms` Which is well within the error bounds of my measurement setup. My profiling suggests that Q1 spends 85% of the time parsing CSV data and approximately 15% of the time doing the aggregation. Of that 15%, 10% is in `create_hashes` and 3% is looking up in the hash table. My next steps are is going to be: 1. create a benchmark program that is not IO bound with string / string dictionary grouping keys for which I think this approach will be most beneficial 2. Complete the optimized implementation of `ScalarValue::eq_array`, after which perhaps you can rerun your benchmarks and we can see if they get better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
