leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-469866645
 
 
   This is very helpful and I think I see where we are talking past each other. 
 There appear to be several different use cases for the concept of "TopN".  The 
starting point for these use cases is after the selection and other possible 
SQL manipulation, where you end up with a set of distributed lists of items 
where the task is to efficiently merge together these lists and obtain the 
"TopN" of these items.  It is the next operations that follow that distinguish 
the several use cases. 
   
   -  Case 1:  The items are `Comparable` with respect to some property `y`, 
which does not have to be numeric.  It could be a string.   If `y` is numeric, 
it could be either positive or negative, it doesn't matter. The other 
significant property is that these items would not be aggregated together, e.g. 
it would not be meaningful to do an operation like `ItemC = ItemA + ItemB`.  To 
finish, the following operations would be performed:  (Assume N = 100).  Note 
that the "order" here could be either direction so that one could also obtain 
the "BottomN".
   -- Order the items in each node . 
   -- Select top 100 from the list in each node (Note: we don't need more than 
N from each node)
   -- Merge the lists together in the Broker
   -- Order the items in the Broker, choose the top 100.
   
   - Case 2: The items have the `Identity` property such that one can check if 
`ItemA == ItemB`, and they have a `weight` (or importance) property that is 
independent of the `identity` property.  In other words, `ItemA == ItemB and 
possibly ItemA.w1 != ItemB.w2`.   This weight is numeric, positive and 
obviously comparable.  These objects can be "merged" together when the identity 
is satisfied as in `ItemC.w3 =  ItemA.w1 + ItemB.w2` where `w3 = w1 + w2`.    
(Assume finalN = 100).  The current algorithm would be the following
   -- Order the items in each node . 
   -- Select top (nodeN=1000 from the list in each node (Note: we must select 
`nodeN >> finalN`)
   -- Merge the lists together in the Broker
   -- Order the items in the Broker, merging identical items, choose the top 
100.
   The sketch equivalent of this process would be
   -- Feed the items in each node into a FIS.
   -- Each node sends its sketch to the Broker.
   -- The Broker merges the sketches together into a single sketch
   -- The result "Most frequent Items" or "Heaviest Items" are returned from 
the sketch.
   
   - Case 3: This is the same as Case 2, except that negative weight values 
would be allowed. This, unfortunately cannot be solved with our current sketch 
anyway.   We will have to think about this.
   
   I think this addresses all the use cases (this was kinda slapped together).  
But for both cases 1 and 2, improvements can be made in how the final steps are 
performed. For case 3 I would like to ask: "Is this a real case and how often 
does it occur?
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to