Dandandan commented on issue #418: URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851518881
Thanks for the input @jhorstmann - that adds some support for the idea! Something like ~2x speed up for more challenging queries where DF currently "struggles" (or bigger for some extreme queries like 1 value per group) sounds about what I would expect from making the change in DataFusion as well based on some profiling results, in particular for high cardinality queries. It might be an idea to do a count on the number of distinct values inside a batch (that should be doable as we already are inserting the offsets to the HashMap) to decide whether to do a `take` + `batch_update` on low cardinality or whether to combine it in one loop for high cardinality updates? Arrow misses some kind of mutable data currently that can mutate data at offsets (rather than append values) - what are you using for that, or did you build your own structure? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
