alamb commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-894513932
TLDR; I did some more testing and with a synthetic best-case harness and I can see a small (8%) improvement for grouping on utf8 key data (average of 22 characters each). While not as much as I had hoped it is good enough for me to polish this up and get it ready for code review. Ok, I am out of time for the day but I will continue tomorrow. Details: I whipped up a custom harness that is entirely CPU bound (feeds the same record batch in multiple times) Roughly speaking, it tests grouping on 1 billion Rows, with 100 distinct keys calculating a `COUNT()` and `AVG(f64)` aggregate I currently have single column group keys of type: each of int64 utf8 and dict. master: ``` Completed query in 6.674159987s Completed query in 7.769550938s Completed query in 7.467556697s Completed query in 7.452186844s Completed query in 7.379664866s --------------- Completed 5 iterations query in 36.743119332s 136079899.88061365 rows/sec ``` on gby_null_new / https://github.com/apache/arrow-datafusion/pull/808 ``` Completed query in 6.438788872s Completed query in 6.55647331s Completed query in 6.716707907s Completed query in 6.931733327s Completed query in 6.68060826s --------------- Completed 5 iterations query in 33.324311676s 150040608.44866526 rows/sec ``` This gives an idea of what is being tested: ``` Benchmarking select utf8_key, count(*), avg(f64) from t group by utf8_key 100000 batches of 10000 rows = 1000000000 total rows explain select utf8_key, count(*), avg(f64) from t group by utf8_key +---------------+---------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: #t.utf8_key, #COUNT(UInt8(1)), #AVG(t.f64) | | | Aggregate: groupBy=[[#t.utf8_key]], aggr=[[COUNT(UInt8(1)), AVG(#t.f64)]] | | | TableScan: t projection=Some([1, 3]) | | physical_plan | ProjectionExec: expr=[utf8_key@0 as utf8_key, COUNT(UInt8(1))@1 as COUNT(UInt8(1)), AVG(t.f64)@2 as AVG(f64)] | | | HashAggregateExec: mode=FinalPartitioned, gby=[utf8_key@0 as utf8_key], aggr=[COUNT(UInt8(1)), AVG(f64)] | | | CoalesceBatchesExec: target_batch_size=4096 | | | RepartitionExec: partitioning=Hash([Column { name: "utf8_key", index: 0 }], 16) | | | HashAggregateExec: mode=Partial, gby=[utf8_key@0 as utf8_key], aggr=[COUNT(UInt8(1)), AVG(f64)] | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | RepeatExec repeat=100000 | +---------------+---------------------------------------------------------------------------------------------------------------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
