[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Fri, 06 Aug 2021 13:56:36 -0700


alamb commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-894513932



   TLDR; I did some more testing and with a synthetic best-case harness and I 
can see a small (8%) improvement for grouping on utf8 key data (average of 22 
characters each). While not as much as I had hoped it is good enough for me to 
polish this up and get it ready for code review. 
   
   Ok, I am out of time for the day but I will continue tomorrow.
   
   Details: I whipped up a custom harness that is entirely CPU bound (feeds the 
same record batch in multiple times)
   
   Roughly speaking, it tests grouping on 1 billion Rows, with 100 distinct 
keys calculating a `COUNT()` and `AVG(f64)` aggregate
   
   I currently have single column group keys of type: each of int64 utf8 and 
dict.
   
   master:
   ```
   Completed query in 6.674159987s
   Completed query in 7.769550938s
   Completed query in 7.467556697s
   Completed query in 7.452186844s
   Completed query in 7.379664866s
   ---------------
   Completed 5 iterations query in 36.743119332s 136079899.88061365 rows/sec
   ```
   
   on gby_null_new / https://github.com/apache/arrow-datafusion/pull/808
   ```
   Completed query in 6.438788872s
   Completed query in 6.55647331s
   Completed query in 6.716707907s
   Completed query in 6.931733327s
   Completed query in 6.68060826s
   ---------------
   Completed 5 iterations query in 33.324311676s 150040608.44866526 rows/sec
   ```
   
   This gives an idea of what is being tested:
   ```
   Benchmarking select utf8_key, count(*), avg(f64) from t group by utf8_key
     100000 batches of 10000 rows = 1000000000 total rows
   explain select utf8_key, count(*), avg(f64) from t group by utf8_key
   
+---------------+---------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                   |
   
+---------------+---------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #t.utf8_key, #COUNT(UInt8(1)), #AVG(t.f64)     
                                                   |
   |               |   Aggregate: groupBy=[[#t.utf8_key]], 
aggr=[[COUNT(UInt8(1)), AVG(#t.f64)]]                                   |
   |               |     TableScan: t projection=Some([1, 3])                   
                                                   |
   | physical_plan | ProjectionExec: expr=[utf8_key@0 as utf8_key, 
COUNT(UInt8(1))@1 as COUNT(UInt8(1)), AVG(t.f64)@2 as AVG(f64)] |
   |               |   HashAggregateExec: mode=FinalPartitioned, 
gby=[utf8_key@0 as utf8_key], aggr=[COUNT(UInt8(1)), AVG(f64)]    |
   |               |     CoalesceBatchesExec: target_batch_size=4096            
                                                   |
   |               |       RepartitionExec: partitioning=Hash([Column { name: 
"utf8_key", index: 0 }], 16)                         |
   |               |         HashAggregateExec: mode=Partial, gby=[utf8_key@0 
as utf8_key], aggr=[COUNT(UInt8(1)), AVG(f64)]       |
   |               |           RepartitionExec: 
partitioning=RoundRobinBatch(16)                                                
   |
   |               |             RepeatExec repeat=100000                       
                                                   |
   
+---------------+---------------------------------------------------------------------------------------------------------------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to