peferron commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-470386898
 
 
   @leerho As mentioned earlier, from my understanding of the topN code, there 
is no n log n sorting of all items in the segment (a node can contain multiple 
segments, which contain usually up to 5M rows each and can be processed in 
parallel; per-segment results are merged locally on the node, then the result 
is passed to the broker for another round of merging with other node results). 
The process is:
   
   1. Aggregate values of all n items in the segment, grouping by item key. 
Time O(n) and space O(m) where m is the number of distinct item keys. Worst 
case is m = n so this can actually take a lot of space compared to FIS if 
cardinality is high.
   2. Take top k items by aggregated value. This is currently done with a 
priority queue, so time O(m log k) and space O(k). Time could be improved to 
O(m) with quickselect but it might not make a difference in practice.
   
   Regarding the HLL Map sketch, I was indeed referring to the one in the Yahoo 
Datasketches lib (should have made this clearer, sorry). If a topN includes a 
HLL aggregation, the current implementation keeps a dedicated HLL sketch for 
each distinct item key, so you can end up with millions of low-cardinality HLL 
sketches. It looks like the HLL Map sketch is designed to improve this 
scenario. But it probably would be quite a lot of work, as a naïve 
implementation that duplicates the set of item keys would waste most of the 
space gained.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to