wjhypo opened a new issue #11096:
URL: https://github.com/apache/druid/issues/11096


   ### Affected Version
   
   Druid master branch.
   
   ### Description
   
   For an example query like: `select sum(ct) as cnt from x where id = 123 
group by dim1, dim2 order by cnt desc limit 5` 
   Normally, all candidate rows (can be millions) will be retrieved on each 
historical node and transfer back to broker, which in turn does a global sort 
and take top 5. If force limit push down is enabled, only the top 5 result in 
each historical node will be taken and transferred back to broker to merge 
which is much faster. When this mechanism is used, the results are usually 
approximate. However, if during ingestion time, we make sure rows of the same 
group always go to the same segment, then we should be getting accurate results 
along with free performance gain. Even if we don't do the above specific 
partitioning, we should still be getting results with acceptable accuracy with 
local top N, but in production we found the accuracy is lower than expected. 
   
   Take the above query as an example, to get local top N on historicals, we 
need to look at the value of the aggregator used in the order by statement, 
`cnt`, but it seems the current code has a bug in the comparison logic and does 
not use the correct offset of aggregators which seriously affects the accuracy.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to