leerho commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-473491408 @leventov @gianm @peferron @Dylan1312 I removed two comments I had here, because after thinking about this some more, I am confident that I can make even stronger statements to make my point clear. I the "good old days" of SQL if you issued a query shaped like `SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N`, the DataBase would oblige by performing the operations brute-force on all the data, and would eventually return with the result set of N items that you could call a "TopN", but as we will discover that is a misnomer. The bottom of that list could have some items of equal weight that could represent noise in the sense that they may not have a weight that is greater than the weight of all remaining items not on the list. Nonetheless, this would be considered the "exact" solution. However, in the attempts to speed up the horribly slow brute-force SQL query above, and based on what was described above in this thread, the Druid "TopN" algorithm was created to do this (I think this is right): M Historical Nodes: 1. "First Truncation": All intra-segment resultsets (which are exact) are sorted (using a priority queue) and truncated to max(k, 1000). Call that _K_. 2. "Pre-Aggregation": Then the historical node merges all those per-segment resultsets pairwise by first combining the 2K elements based on the aggregation key, then sorting those 2K elements (again, using a priority queue) into a new priority queue of size _K_. 3. Each H-Node passes a sorted resultset of size max(k, 1000) to the broker. Broker Node: 4. 2nd Truncation and "Post-Aggregation": The data from all M H-Nodes is gathered, and sorted and truncated again using a Priority Queue with a size max(k, 1000). 5. K items are reported as the result. **These truncations and aggregations change everything!** The **first belief** that: > The current topN algorithm **can be used for any query** shaped like SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled. This belief is a total fiction and this claim will fail miserably. The **second belief**: > Going back to your song example, could FIS support getting the top song titles by number of unique listeners over the past week? **The current topN can do that using HLL sketches as metric** (weight). That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric. This claim of the ability to plug-in HLL sketches as the aggregator will also fail miserably. The **third** belief: > It sounds like FIS is **limited** in a couple ways that prevents it from being used for all queries of that form: (because) > **"AGG" must be "SUM" and y must all be positive**. (topN will accept any aggregation function and any y) The view here is that somehow FIS is "limited", and is a "subset case" because it can only `sum` positive values is seriously misguided. In fact the **only** "aggregation" operation that can possibly produce any meaningful results with the Druid TopN algorithm is a sum of positive weights. And this case that kinda-sorta works is when the data in the H-nodes is pre-aggregated (only sum of positive weights allowed) prior to the 1st truncation (Priority Queue). Then the PQs are merged, summed and PQed again, before passing to the Broker. The process in the Broker would be similar. But this still could have significant errors. The **forth** belief: > "LIMIT" (using FIS) will not necessarily be adhered to. (topN will always return LIMIT results, or the number of distinct values of x, whichever is lower). This is also misguided. For a set of items where all have the same weight, there is no TopN! Any random selection of N values from the set could be a "valid" result, which makes the name "TopN" meaningless. In order for a set of values to have a "Top", there must be some skew in their weights. Suppose the set has 1000 items with weight 1, and 2 items of weight 10. The top-100 will have the two heavy items, the other 98 could be randomly chosen from the remainder of the original set . Those 98 add no insight as they are essentially noise. So to require that there be N items returned from a TopN query is severely misguided. The Druid TopN also doesn't tell you which one of the small end of the TopN are noise, making it even less useful. The better term to use is either "Frequent Items" or "Heavy Hitters", both of which are used in the scientific literature. ---- In conclusion: 1. The belief that a query of the shape `SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N` can actually produce meaningful results using the Druid TopN algorithm is misguided. It is especially misguided if the item weights are negative or those aggregations are non-linear functions such as an HLL sketch. 2. The only query that might kinda-sorta-sometimes produce meaningful results is when the weights are positive and the aggregations are `sum`. But if this is the case, why not use the FIS, which operates with positive weights and the "aggregation" is a `sum`. And it will produce far superior results. In other words the FIS is not a subset of what can be usefully queried, it is the ONLY query set that can produce meaningful results.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
