[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Sat, 07 Aug 2021 06:11:30 -0700


Dandandan edited a comment on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-894652913



   My results on the latest version.
   
   ```
   q1 took 36 ms
   q2 took 358 ms
   q3 took 998 ms
   q4 took 50 ms
   q5 took 983 ms
   q7 took 911 ms
   q10 took 4075 ms
   ```
   
   q4 is improved in the latest version compared to earlier (it used a int32 
column to group on). q2 still looks a bit (~10%) slower.
   
   The query is: `SELECT id1, id2, SUM(v1) AS v1 FROM tbl GROUP BY id1, id2` 
(id1, id2 are utf8, v1 is an int32)
   
   I wondering whether this comment 
https://github.com/apache/arrow-datafusion/pull/808/files#r683975473 might help 
a bit as it does some additional cloning of `ScalarValue`s.
   
   Another cause coulde be that hashing and comparing one `Vec<u8>` might be 
faster than hashing two single strings and combining them afterwards (however I 
would expect the extra copying / rehashing to be worse than the single cost of 
hashing itself)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to