[GitHub] [arrow-datafusion] Dandandan commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

GitBox Mon, 31 May 2021 07:13:38 -0700


Dandandan commented on issue #418:
URL: 
https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851518881



   Thanks for the input @jhorstmann - that adds some support for the idea! 
Something like ~2x speed up for more challenging queries where DF currently 
"struggles" (or bigger for some extreme queries like 1 value per group) sounds 
about what I would expect from making the change in DataFusion as well based on 
some profiling results, in particular for high cardinality queries. It might be 
an idea to do a count on the number of distinct values inside a batch (that 
should be doable as we already are inserting the offsets to the HashMap) to 
decide whether to do a `take` + `batch_update` on low cardinality or whether to 
combine it in one loop for high cardinality updates?
   
   Arrow misses some kind of mutable data currently that can mutate data at 
offsets (rather than append values) - what are you using for that, or did you 
build your own structure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Reply via email to