Dandandan commented on PR #15985: URL: https://github.com/apache/datafusion/pull/15985#issuecomment-2863220782
This gets a small performance boost on clickbench query 9 (~9% on my end). I am actually wondering if we can do further. I think we could store something like HashSet<(T::Native, usize)> (unique value + group id) instead of `Vec<HashSet<T::Native>>` (hashset per group) and delaying counting the values until the end by iterating all the values (instead of `.len()`). "Obvious" advantage is that we avoid creating *many* hashsets for high cardinality cases which makes performance and memory usage bad. However it seems kind of tricky of how to integrate it in the current groupsaccumulator setup 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org