leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-472623400
 
 
   Just to be clear and independent of FIS, I'm not advocating removing TopN. 
   
   What is very problematic in the current implementation is allowing 
aggregation functions after limiting / truncation of the data.  This can lead 
to wildly wrong results such that it is misleading to even call it "TopN" 
anymore.  This is very easy to prove, by the way, that you could be entirely 
missing the Top-1 and with a little extra math, prove that it is not too hard 
to miss all of the TopN.
   
   You can still have TopN and you can still provide the aggregation functions 
as long as they are all performed prior to any limiting/truncation.  If you 
follow this, It changes the `max(k, 1000)` step to just `k`, which means you 
will be sending far less data to the broker.  And now when the broker  applies 
a` PriorityQueue(k)`, you will have a TopN that is no longer data sensitive and 
quite robust.  Although it will likely be slower.
   
   I realize that speed is all important to Druid and that is why I love Druid 
as I am a speed freak too :)
   But allowing functionality that effectively corrupts the intent of the query 
is not a good idea as it can come back to bite no matter how much you caveat it 
in the documentation. 
   
   "If it doesn't have to work, it can meet any requirement."
   "If you don't care about quality, you can achieve any objective."
    -- G.M. Weinberg
   
   Yes, perhaps the FIS should be a completely separate function, that is up to 
you.  But it is the only mechanism that will allow a simple form of aggregation 
and "truncation" at the same time, is fast, data insensitive, and with known 
error.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to