peferron edited a comment on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-473508823 @leerho > The **first belief** that: > > > The current topN algorithm **can be used for any query** shaped like SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled. > > This belief is a total fiction and this claim will fail miserably. You can use `topN` with any aggregation and get back results. This is a fact, not a belief. The misguided belief would be that these results are always meaningful, but I don't recall anybody here saying that; it has been repeated again and again that `topN` is data-sensitive. > The **second belief**: > > This claim of the ability to plug-in HLL sketches as the aggregator will also fail miserably. Same answer as above. You can absolutely plug a HLL sketch as a `topN` aggregation. Whether the results are meaningful or not depends on the data. > The **third** belief: > > [...] > > In fact the **only** "aggregation" operation that can possibly produce any meaningful results with the Druid TopN algorithm is a sum of positive weights. Since this assertion is worded very strongly, it can be disproven by finding a counter-example where meaningful results are produced using `topN` in conjunction with, say, a HLL sketch. That's easy to do for almost anyone with access to a Druid cluster. That happens to be my case, so I just ran queries against about a billion rows of real-world data. The queries are `SELECT x, COUNT(DISTINCT y) GROUP BY 1 ORDER BY 2 DESC LIMIT 5`. In the first test, `x` is a dimension with a cardinality of 30k, and `y` is another dimension with high cardinality. `topN` returns the same results as `groupBy` (which is the exact counterpart of `topN`), but 30% faster. In the second test, `x` is a dimension with a cardinality of 2M. `groupBy` cannot return results because it exceeds resource limits. After reducing the query interval enough to bring the cardinality of `x` down to levels that `groupBy` can handle, `topN` again returns the same results as `groupBy`, but 75% faster. So, here we are—meaningful results produced using `topN` with HLL, and with significant performance gains. These are the first queries I tried, but I'm sure I could find some where `topN` fails miserably. That's not the point, though—the point is that `topN` works well in some cases, as demonstrated. You seem to think that Druid users, including participants in this discussion, are incapable of identifying these cases, and are using `topN` only because they don't understand how badly it can fail. That's certainly your prerogative, and you may be right for the possibly large number of Druid users who don't dig into how the system works or read the documentation. As a somewhat power user though, I'm happy to have a tool like `topN` at my disposal, even if it's up to me to use it responsibly. That being said, I certainly hope that Druid goes towards a direction where error bounds are systematically provided, as discussed in #7160. So thank you and @AlexanderSaydakov for starting this movement by integrating the DS library.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
