peferron commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-470386898 @leerho As mentioned earlier, from my understanding of the topN code, there is no n log n sorting of all items in the segment (a node can contain multiple segments, which contain usually up to 5M rows each and can be processed in parallel; per-segment results are merged locally on the node, then the result is passed to the broker for another round of merging with other node results). The process is: 1. Aggregate values of all n items in the segment, grouping by item key. Time O(n) and space O(m) where m is the number of distinct item keys. Worst case is m = n so this can actually take a lot of space compared to FIS if cardinality is high. 2. Take top k items by aggregated value. This is currently done with a priority queue, so time O(m log k) and space O(k). Time could be improved to O(m) with quickselect but it might not make a difference in practice. Regarding the HLL Map sketch, I was indeed referring to the one in the Yahoo Datasketches lib (should have made this clearer, sorry). If a topN includes a HLL aggregation, the current implementation keeps a dedicated HLL sketch for each distinct item key, so you can end up with millions of low-cardinality HLL sketches. It looks like the HLL Map sketch is designed to improve this scenario. But it probably would be quite a lot of work, as a naïve implementation that duplicates the set of item keys would waste most of the space gained.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
