[GitHub] [incubator-druid] leerho commented on issue #7187: Improve topN algorithm

GitBox Wed, 20 Mar 2019 23:04:35 -0700

leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-475121316
 
 
   @peferron 
   
   Thank you for your response and for performing a test with an HLL aggregator.
   
   The intent of my statements in the preceding entries is to protect the end 
users of Druid, and ultimately to protect the reputation of Druid as a 
trustworthy platform.
   
   In my statements above, my use of the word "meaningful" is with respect to 
an end user and is meant to imply that the results of **any** query of the 
above shape could be interpreted usefully (i.e., correct interpretation and 
similar to what a brute-force exact computation would produce).   
   
   My use of the word "belief" is with again with respect to the end-user:  
That he/she might believe (because the Druid specifications say that it can be 
done) that the **results** of **any** query of the shape stated above would be 
meaningful (useful and correct).  Of course the query can be executed, that is 
beside the point, and doesn't require much trust or belief.  
   
   Unfortunately, running one test does not prove that an algorithm or process 
works correctly.  Nor does running a hundred or thousand such tests.  However, 
all one has to do to prove that an algorithm or process does not work is to 
provide one counter example test that does not work. 
   
   Relying on the fact that the documentation states that the TopN process is 
_data sensitive_ is a very weak argument, as most end-users probably don't read 
the documentation, or don't think about what that phrase might mean.  
   
   The fact that a data sensitive query might run 30% or 75% faster is a 
misleading "sales pitch" to the end users, because in order for an end user to 
actually trust that the results from such a query is usefully correct, the end 
user would have to run each query in brute force, exact mode and check its 
results with the results from the "faster" query.  The total time to do both 
queries is clearly longer than just running the brute force exact query by 
itself.   If you think this is ridiculous,  it derives directly from the "data 
sensitive" statement.  Because you cannot prove that a "data sensitive" process 
works on all data, because you can't possibly have access to all data.  More 
importantly, you don't have access to the end-user's data!  The statement "data 
sensitive" is a huge admission that the process may not work, and without 
running brute-force tests on all runs, the user would never know whether he/she 
can trust the results or not.   
   
   It appears that it is your opinion that just documenting the process with 
the words "data sensitive" is enough, and if the user wishes to shoot 
themselves in the foot that that is OK.  I don't share that view, and perhaps 
nothing I say will change your mind, nonetheless, there are a couple of 
thoughts that perhaps you might consider.
   
   1. Allowing positive and negative values as weights in the aggregations.  
This actually can work if there is no limiting nor truncations of the data 
followed by aggregations.  However, with truncations followed by aggregations, 
negative and positive weights can easily lead to cancellations of weights of 
items across segments or nodes within the limited region and nullifying any 
earlier TopN results from individual segments or nodes.  This, I hope you can 
see, would easily lead to nonsensical results.
   
   2. Allowing non-linear processes like unique counting (HLL), as an 
aggregation option, can also lead to very strange results both within the 
truncated regions and outside the truncated regions.  This is not as serious a 
problem as allowing negative weights.  It turns out that the use of a 
unique-counting algorithm like HLL as an aggregator inside a Heavy-Hitter / 
Frequent-Item sketch has been written about and studied quite a bit in the 
scientific literature.   However, the algorithms that have been proposed so far 
are pretty nasty and it is not at all clear that they are practical (e.g., not 
mergeable, complex, the results are probabilistic, and other issues).  
   
   For a mergeable process that incorporates truncation / limiting followed by 
aggregation, to date we know of no algorithm that would produce trustworthy 
results if the aggregations allow negative values or non-linear calculations 
(e.g. unique counting).   
   
   To lead the user to believe (or hope) that an expression like
   `SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when 
useApproximateTopN is enabled` would produce meaningful, trustworthy results 
with arbitrary aggregation plug-ins is fundamentally misleading.  And hiding 
behind the "data-sensitive" caveat doesn't change that. 
   
   The only algorithms that do incorporate truncation / limiting followed by 
aggregation and do provide trustworthy results are the sketching algorithms 
described in the scientific literature: the main ones are Misra-Gries, 
Space-Saver, Heavy-Hitter, and Frequent-Items algorithms.  And all of these 
algorithms require that the weight aggregations must be linear, additive with 
positive numbers.   These are not a subset of a larger working set of working 
algorithms. They are the only set that, to date, work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] leerho commented on issue #7187: Improve topN algorithm

Reply via email to