[GitHub] [incubator-druid] leerho commented on issue #7187: Improve topN algorithm

GitBox Wed, 06 Mar 2019 19:06:58 -0800

leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-470366530
 
 
   > getting the top song titles by number of unique listeners over the past 
week
   
   @peferron Now that is very interesting!  The current FIS doesn't handle this 
case, but I think it could be extended to do that, or to have a configurable 
aggregate function.  I'll have to think about it.  
   
   I agree with you that having the ability to specify any Druid aggregation or 
post-aggregation as a metric is pretty powerful and I'm not trying to replace 
that functionality.  
   
   My principal concern is the way the final steps are performed to obtain the 
Top 100(for example) items:  
   - for each node sort all items in the node // cost: n log n //where n = 
total items in the node
   - pass top 1000 items to the Broker //network cost
   - The Broker sorts (numNodes(M) * 1000 items): cost: M * 1000 log (M*1000)
   - The Broker chooses the top 100.
   
   This heuristic approach is error prone, data sensitive and compute-wise 
costly.  This, I think, can be improved, but not by substituting FIS!
   
   I am not familiar with how the pre- and post-aggregation steps intertwine 
themselves with this process.  But perhaps this heuristic could be accomplished 
with a Min-Heap?  It would be faster, smaller, not data sensitive, and 
generally more robust.  It would change the compute cost from a `n log n` to a 
`n log k`, which could be significant.
   
   This (I realize now) has nothing to do with FIS.  As I tried to state above, 
that Druid's concept of "TopN" consists of several very different types of 
operations.  And the current FIS would only help in a subset of them.)
   
   But where FIS could apply (many duplicates (with weights) in the stream and 
you wish to return the heaviest weighted items (or the most frequent items).  
Now the the compute cost reduces to just `n` and makes it suitable for 
real-time queries.  This type of query is fairly common and it might deserve a 
special query type of its own. 
   
   The idea of a compound FrequentItem<Item, Aggregator> sketch sounds very 
interesting, but we don't have it now, so it would have to be developed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] leerho commented on issue #7187: Improve topN algorithm

Reply via email to