himanshug opened a new issue #8413: Improve GroupBy query execution to push limit to segment scan phase URL: https://github.com/apache/incubator-druid/issues/8413 ### Motivation GroupBy query execution has ways to push down limits to queryable nodes for merging which exploits the small limit to cut down on the data transfer and processing. however, currently the limit pushdown is only done till the merge phase at historical while segment scan phase does not exploit the limit pushdown which slows down a groupBy query if there were too many unique rows in the segments with many complex aggregators. so, this proposal is to optionally enable pushing down limit all the way down to segment scan during the GroupBy query processing. ### Proposed changes I have tried a prototype, on 0.15.0 build which is currently running in my Druid cluster, following changes (combined with #8412 ) show tremendous improvements for some of the groupBy queries on large dataset with complex aggregators and aggressive limit. - `GroupByQueryEngineV2.HashAggregateIterator.newGrouper()` is updated to return `LimitedBufferHashGrouper` when `query.isApplyLimitPushDown() == true` - `GroupByQueryEngineV2.GroupByEngineKeySerde` is updated to correctly implement `Grouper.BufferComparator bufferComparator()` and `Grouper.BufferComparator bufferComparatorWithAggregators(..)` methods. - A new method `Grouper.BufferComparator bufferComparator(int keyBufferPosition, @Nullable StringComparator stringComparator)` is added to `GroupByColumnSelectorStrategy` Also, to be safe, I am planning to add a query context flag `enableLimitPushdownToSegment` to enable this optimization . ### Test plan (optional) There are existing tests with queries the push down limits, will update them to also run with `enableLimitPushdownToSegment=true`
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
