leerho commented on issue #6581: Moments Sketch custom aggregator
URL: https://github.com/apache/incubator-druid/pull/6581#issuecomment-460860422
 
 
   @edgan8 Thank you for your reply.  
   
   > ...larger than expected errors...
   
   1. What is the "expected error"?  The fact is that the Moments Sketch is an 
empirical algorithm and you cannot state *a priori* what the expected error 
will be on any subsequent query.  
   
   2. Providing a patch to fix the zeros spike is but a band-aid for this 
particular data set.  It does not provide any guarantees for better error on 
other data sets.  As I stated in the study, "real data can be quite ugly", and 
it is not hard to find real data sets with lots of strange bumps, spikes and 
values that do not fit the theoretically smooth and well-behaved distributions 
that the Moments Sketch relies on.  
   
   3. If a user does not bother to perform an exact analysis or compare the 
Moment Sketch results with a sketch that provides, *a priori*, useful (e.g. 
*L_inf*) error bounds, he/she would never know that there was a severe error 
problem!  This is my biggest concern.  Even the Druid engineers do not know 
what kinds of data that their customers will be asking Druid to analyze. 
   
   4. There is no argument that the Moments Sketch is fast and small in size, 
but what good is that if it can produce large errors as a result?  Worse, the 
end user has no way to know, after the fact, what the error actually is!
   
   5. Hopefully, we can learn from Druid's experience with the Approximate 
Histogram and the Druid HLL sketch.  The fact that early on Druid incorporated 
those two algorithms into their core, users assumed that they must be good and 
that they were sufficiently tested and studied.  After all, the Druid engineers 
are a bunch of really smart people! 
   
   Unfortunately, Druid end users did not bother to find and read the AH paper 
where the authors admit that the AH algorithm has serious limitations.  And 
they likely didn't have the necessary skills to do a deep dive into the Druid 
HLL sketch algorithm to uncover its problems.  Now, unfortunately, both groups 
of users are stuck with lots of historical data of dubious quality with no 
means of recovery.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to