leerho commented on issue #6581: Moments Sketch custom aggregator URL: https://github.com/apache/incubator-druid/pull/6581#issuecomment-460860422 @edgan8 Thank you for your reply. > ...larger than expected errors... 1. What is the "expected error"? The fact is that the Moments Sketch is an empirical algorithm and you cannot state *a priori* what the expected error will be on any subsequent query. 2. Providing a patch to fix the zeros spike is but a band-aid for this particular data set. It does not provide any guarantees for better error on other data sets. As I stated in the study, "real data can be quite ugly", and it is not hard to find real data sets with lots of strange bumps, spikes and values that do not fit the theoretically smooth and well-behaved distributions that the Moments Sketch relies on. 3. If a user does not bother to perform an exact analysis or compare the Moment Sketch results with a sketch that provides, *a priori*, useful (e.g. *L_inf*) error bounds, he/she would never know that there was a severe error problem! This is my biggest concern. Even the Druid engineers do not know what kinds of data that their customers will be asking Druid to analyze. 4. There is no argument that the Moments Sketch is fast and small in size, but what good is that if it can produce large errors as a result? Worse, the end user has no way to know, after the fact, what the error actually is! 5. Hopefully, we can learn from Druid's experience with the Approximate Histogram and the Druid HLL sketch. The fact that early on Druid incorporated those two algorithms into their core, users assumed that they must be good and that they were sufficiently tested and studied. After all, the Druid engineers are a bunch of really smart people! Unfortunately, Druid end users did not bother to find and read the AH paper where the authors admit that the AH algorithm has serious limitations. And they likely didn't have the necessary skills to do a deep dive into the Druid HLL sketch algorithm to uncover its problems. Now, unfortunately, both groups of users are stuck with lots of historical data of dubious quality with no means of recovery.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
