wjhypo opened a new issue #11096: URL: https://github.com/apache/druid/issues/11096
### Affected Version Druid master branch. ### Description For an example query like: `select sum(ct) as cnt from x where id = 123 group by dim1, dim2 order by cnt desc limit 5` Normally, all candidate rows (can be millions) will be retrieved on each historical node and transfer back to broker, which in turn does a global sort and take top 5. If force limit push down is enabled, only the top 5 result in each historical node will be taken and transferred back to broker to merge which is much faster. When this mechanism is used, the results are usually approximate. However, if during ingestion time, we make sure rows of the same group always go to the same segment, then we should be getting accurate results along with free performance gain. Even if we don't do the above specific partitioning, we should still be getting results with acceptable accuracy with local top N, but in production we found the accuracy is lower than expected. Take the above query as an example, to get local top N on historicals, we need to look at the value of the aggregator used in the order by statement, `cnt`, but it seems the current code has a bug in the comparison logic and does not use the correct offset of aggregators which seriously affects the accuracy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
