[GitHub] [incubator-druid] leerho commented on issue #7187: Improve topN algorithm

GitBox Fri, 15 Mar 2019 19:33:31 -0700

leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-473491408
 
 
   @leventov @gianm @peferron @Dylan1312 
   
   I removed two comments I had here, because after thinking about this some 
more, I am confident that I can make even stronger statements to make my point 
clear.
   
   I the "good old days" of SQL if you issued a query shaped like `SELECT x, 
AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N`, the DataBase would oblige by 
performing the operations brute-force on all the data, and would eventually 
return with the result set of N items that you could call a "TopN", but as we 
will discover that is a misnomer.  The bottom of that list could have some 
items of equal weight that could represent noise in the sense that they may not 
have a weight that is greater than the weight of all remaining items not on the 
list.  Nonetheless, this would be considered the "exact" solution. 
   
   However, in the attempts to speed up the horribly slow brute-force SQL query 
above, and based on what was described above in this thread, the Druid "TopN" 
algorithm was created to do this (I think this is right):
   
   M Historical Nodes:
   1. "First Truncation": All intra-segment resultsets (which are exact) are 
sorted (using a priority queue) and truncated to max(k, 1000). Call that _K_.
   2. "Pre-Aggregation": Then the historical node merges all those per-segment 
resultsets pairwise by first combining the 2K elements based on the aggregation 
key, then sorting those 2K elements (again, using a priority queue) into a new 
priority queue of size _K_.
   3. Each H-Node passes a sorted resultset of size max(k, 1000) to the broker.
   
   Broker Node:
   
   4. 2nd Truncation and "Post-Aggregation": The data from all M H-Nodes is 
gathered, and sorted and truncated again using a Priority Queue with a size 
max(k, 1000).
   5. K items are reported as the result.
   
   **These truncations and aggregations change everything!**
   
   The **first belief** that:
   > The current topN algorithm **can be used for any query** shaped like 
SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when 
useApproximateTopN is enabled.
   
   This belief is a total fiction and this claim will fail miserably. 
   
   The **second belief**:
   > Going back to your song example, could FIS support getting the top song 
titles by number of unique listeners over the past week? **The current topN can 
do that using HLL sketches as metric** (weight). That's one example of the 
usefulness of accepting any Druid aggregation or post-aggregation as a metric.
   
   This claim of the ability to plug-in HLL sketches as the aggregator will 
also fail miserably.  
   
   The **third** belief:
   > It sounds like FIS is **limited** in a couple ways that prevents it from 
being used for all queries of that form: (because)
   > **"AGG" must be "SUM" and y must all be positive**. (topN will accept any 
aggregation function and any y)
   
   The view here is that somehow FIS is "limited", and is a "subset case" 
because it can only `sum` positive values is seriously misguided.  
   
   In fact the **only** "aggregation" operation that can possibly produce any 
meaningful results with the Druid TopN algorithm is a sum of positive weights. 
And this  case that kinda-sorta works is when the data in the H-nodes is 
pre-aggregated (only sum of positive weights allowed) prior to the 1st 
truncation (Priority Queue). Then the PQs are merged, summed and PQed again, 
before passing to the Broker.  The process in the Broker would be similar.  But 
this still could have significant errors.
   
   The **forth** belief:
   > "LIMIT" (using FIS) will not necessarily be adhered to. (topN will always 
return LIMIT results, or the number of distinct values of x, whichever is 
lower).  
   
   This is also misguided.  For a set of items where all have the same weight, 
there is no TopN!  Any random selection of N values from the set could be a 
"valid" result, which makes the name "TopN" meaningless.  In order for a set of 
values to have a "Top", there must be some skew in their weights.
   
   Suppose the set has 1000 items with weight 1, and 2 items of weight 10.  The 
top-100 will have the two heavy items, the other 98 could be randomly chosen 
from the remainder of the original set .  Those 98 add no insight as they are 
essentially noise.  So to require that there be N items returned from a TopN 
query is severely misguided.  The Druid TopN also doesn't tell you which one of 
the small end of the TopN are noise, making it even less useful.  The better 
term to use is either "Frequent Items" or "Heavy Hitters", both of which are 
used in the scientific literature.
   
   ----
   In conclusion:
   
   1. The belief that a query of the shape `SELECT x, AGG(y) FROM table ORDER 
BY AGG(y) DESC LIMIT N` can actually produce meaningful results using the Druid 
TopN algorithm is misguided.  It is especially misguided if the item weights 
are negative  or those aggregations are non-linear functions such as an HLL 
sketch.  
   
   2. The only query that might kinda-sorta-sometimes produce meaningful 
results is when the weights are positive and the aggregations are `sum`.  But 
if this is the case, why not use the FIS, which operates with positive weights 
and the "aggregation" is a `sum`.   And it will produce far superior results.
   
   In other words the FIS is not a subset of what can be usefully queried, it 
is the ONLY query set that can produce meaningful results.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] leerho commented on issue #7187: Improve topN algorithm

Reply via email to