leerho commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-475121316 @peferron Thank you for your response and for performing a test with an HLL aggregator. The intent of my statements in the preceding entries is to protect the end users of Druid, and ultimately to protect the reputation of Druid as a trustworthy platform. In my statements above, my use of the word "meaningful" is with respect to an end user and is meant to imply that the results of **any** query of the above shape could be interpreted usefully (i.e., correct interpretation and similar to what a brute-force exact computation would produce). My use of the word "belief" is with again with respect to the end-user: That he/she might believe (because the Druid specifications say that it can be done) that the **results** of **any** query of the shape stated above would be meaningful (useful and correct). Of course the query can be executed, that is beside the point, and doesn't require much trust or belief. Unfortunately, running one test does not prove that an algorithm or process works correctly. Nor does running a hundred or thousand such tests. However, all one has to do to prove that an algorithm or process does not work is to provide one counter example test that does not work. Relying on the fact that the documentation states that the TopN process is _data sensitive_ is a very weak argument, as most end-users probably don't read the documentation, or don't think about what that phrase might mean. The fact that a data sensitive query might run 30% or 75% faster is a misleading "sales pitch" to the end users, because in order for an end user to actually trust that the results from such a query is usefully correct, the end user would have to run each query in brute force, exact mode and check its results with the results from the "faster" query. The total time to do both queries is clearly longer than just running the brute force exact query by itself. If you think this is ridiculous, it derives directly from the "data sensitive" statement. Because you cannot prove that a "data sensitive" process works on all data, because you can't possibly have access to all data. More importantly, you don't have access to the end-user's data! The statement "data sensitive" is a huge admission that the process may not work, and without running brute-force tests on all runs, the user would never know whether he/she can trust the results or not. It appears that it is your opinion that just documenting the process with the words "data sensitive" is enough, and if the user wishes to shoot themselves in the foot that that is OK. I don't share that view, and perhaps nothing I say will change your mind, nonetheless, there are a couple of thoughts that perhaps you might consider. 1. Allowing positive and negative values as weights in the aggregations. This actually can work if there is no limiting nor truncations of the data followed by aggregations. However, with truncations followed by aggregations, negative and positive weights can easily lead to cancellations of weights of items across segments or nodes within the limited region and nullifying any earlier TopN results from individual segments or nodes. This, I hope you can see, would easily lead to nonsensical results. 2. Allowing non-linear processes like unique counting (HLL), as an aggregation option, can also lead to very strange results both within the truncated regions and outside the truncated regions. This is not as serious a problem as allowing negative weights. It turns out that the use of a unique-counting algorithm like HLL as an aggregator inside a Heavy-Hitter / Frequent-Item sketch has been written about and studied quite a bit in the scientific literature. However, the algorithms that have been proposed so far are pretty nasty and it is not at all clear that they are practical (e.g., not mergeable, complex, the results are probabilistic, and other issues). For a mergeable process that incorporates truncation / limiting followed by aggregation, to date we know of no algorithm that would produce trustworthy results if the aggregations allow negative values or non-linear calculations (e.g. unique counting). To lead the user to believe (or hope) that an expression like `SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when useApproximateTopN is enabled` would produce meaningful, trustworthy results with arbitrary aggregation plug-ins is fundamentally misleading. And hiding behind the "data-sensitive" caveat doesn't change that. The only algorithms that do incorporate truncation / limiting followed by aggregation and do provide trustworthy results are the sketching algorithms described in the scientific literature: the main ones are Misra-Gries, Space-Saver, Heavy-Hitter, and Frequent-Items algorithms. And all of these algorithms require that the weight aggregations must be linear, additive with positive numbers. These are not a subset of a larger working set of working algorithms. They are the only set that, to date, work.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
