leerho commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-470366530 > getting the top song titles by number of unique listeners over the past week @peferron Now that is very interesting! The current FIS doesn't handle this case, but I think it could be extended to do that, or to have a configurable aggregate function. I'll have to think about it. I agree with you that having the ability to specify any Druid aggregation or post-aggregation as a metric is pretty powerful and I'm not trying to replace that functionality. My principal concern is the way the final steps are performed to obtain the Top 100(for example) items: - for each node sort all items in the node // cost: n log n //where n = total items in the node - pass top 1000 items to the Broker //network cost - The Broker sorts (numNodes(M) * 1000 items): cost: M * 1000 log (M*1000) - The Broker chooses the top 100. This heuristic approach is error prone, data sensitive and compute-wise costly. This, I think, can be improved, but not by substituting FIS! I am not familiar with how the pre- and post-aggregation steps intertwine themselves with this process. But perhaps this heuristic could be accomplished with a Min-Heap? It would be faster, smaller, not data sensitive, and generally more robust. It would change the compute cost from a `n log n` to a `n log k`, which could be significant. This (I realize now) has nothing to do with FIS. As I tried to state above, that Druid's concept of "TopN" consists of several very different types of operations. And the current FIS would only help in a subset of them.) But where FIS could apply (many duplicates (with weights) in the stream and you wish to return the heaviest weighted items (or the most frequent items). Now the the compute cost reduces to just `n` and makes it suitable for real-time queries. This type of query is fairly common and it might deserve a special query type of its own. The idea of a compound FrequentItem<Item, Aggregator> sketch sounds very interesting, but we don't have it now, so it would have to be developed.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
