ahmed-mez commented on PR #19562: URL: https://github.com/apache/datafusion/pull/19562#issuecomment-3728479697
> > This is the comparison output and it's not great 😞 At first glance mainly topk queries are impacted by the perf regression > > I suspect the queries that are most impacted are ones with high group counts (high cardinality grouping). > > I think we would need to actually change the storage to be chunked rather than just the emission part Thanks for the insight @alamb , I did some profiling to validate the hypothesis: I focused on ClickBench Q32 (partitioned dataset) which is ~12x slower on the current branch [profiles comparison (main vs current branch)](https://share.firefox.dev/4qoYS8Y) The hotspots in this PR trace back to multiple `emit` → `take_n`, with the cost distributed across allocating new vectors and copying null bitmaps Example profile (zoomed in on one of multiple `GroupedHashAggregateStream::emit`) <img width="1906" height="542" alt="image" src="https://github.com/user-attachments/assets/2d34fb7e-f9bd-4b36-8e94-e15a7e0b0ec4" /> This confirms that with high-cardinality groups and frequent emissions, we're doing massive memory operations repeatedly. That said, it looks like chunked storage is needed and it'd be great to agree on a path forward. This is a larger change touching all `GroupColumn` implementations. Should we create an epic to track all this (incremental emission + chunked storage)? Happy to help scope it out. Thanks for your help and patience guiding me through this investigation! 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
