leerho commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-472185432 Perhaps it is "well-known" amongst the Druid developers, but I think the Druid team is taking a serious risk with Druid's reputation. Druid's customers may not fully understand that the current TopN functionality may return garbage, even if it is discussed in the documentation (who reads the manual, anyway :) ). Quoting @peferron above: > Going back to your song example, could FIS support getting the top song titles by number of unique listeners over the past week? The **current topN can do that** using HLL sketches as metric (weight). That's one example of the usefulness of accepting any Druid aggregation or post-aggregation as a metric. It is not clear from his statement whether he realizes that "accepting any Druid aggregation or post-aggregation" may not work as intended and, in fact may produce meaningless or misleading results. Data sensitivity is not a good thing especially given that there is no easy way for a customer to determine whether the data he/she has processed is producing trustworthy results or not. I would consider the current TopN functionality to be broken and unreliable, period. Here are two alternatives (there may be others): - In any "TopN" operation do not allow any aggregation steps after the data set has been "limited". This has to be true both on the nodes and on the broker. 7 - Adopt the Frequent Items sketch for "TopN" operations. Accept that the ordering weight metric is positive integers (longs) and that the aggregation operation is `add()`. (If you feel you really need real-valued weights, then we can develop for you an FIS based on doubles. That is not a big deal.) My point is that a better, more efficient, and free solution exists for the most common TopN type operations. Why not use it?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
