leerho commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-469866645 This is very helpful and I think I see where we are talking past each other. There appear to be several different use cases for the concept of "TopN". The starting point for these use cases is after the selection and other possible SQL manipulation, where you end up with a set of distributed lists of items where the task is to efficiently merge together these lists and obtain the "TopN" of these items. It is the next operations that follow that distinguish the several use cases. - Case 1: The items are `Comparable` with respect to some property `y`, which does not have to be numeric. It could be a string. If `y` is numeric, it could be either positive or negative, it doesn't matter. The other significant property is that these items would not be aggregated together, e.g. it would not be meaningful to do an operation like `ItemC = ItemA + ItemB`. To finish, the following operations would be performed: (Assume N = 100). Note that the "order" here could be either direction so that one could also obtain the "BottomN". -- Order the items in each node . -- Select top 100 from the list in each node (Note: we don't need more than N from each node) -- Merge the lists together in the Broker -- Order the items in the Broker, choose the top 100. - Case 2: The items have the `Identity` property such that one can check if `ItemA == ItemB`, and they have a `weight` (or importance) property that is independent of the `identity` property. In other words, `ItemA == ItemB and possibly ItemA.w1 != ItemB.w2`. This weight is numeric, positive and obviously comparable. These objects can be "merged" together when the identity is satisfied as in `ItemC.w3 = ItemA.w1 + ItemB.w2` where `w3 = w1 + w2`. (Assume finalN = 100). The current algorithm would be the following -- Order the items in each node . -- Select top (nodeN=1000 from the list in each node (Note: we must select `nodeN >> finalN`) -- Merge the lists together in the Broker -- Order the items in the Broker, merging identical items, choose the top 100. The sketch equivalent of this process would be -- Feed the items in each node into a FIS. -- Each node sends its sketch to the Broker. -- The Broker merges the sketches together into a single sketch -- The result "Most frequent Items" or "Heaviest Items" are returned from the sketch. - Case 3: This is the same as Case 2, except that negative weight values would be allowed. This, unfortunately cannot be solved with our current sketch anyway. We will have to think about this. I think this addresses all the use cases (this was kinda slapped together). But for both cases 1 and 2, improvements can be made in how the final steps are performed. For case 3 I would like to ask: "Is this a real case and how often does it occur?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
