leerho commented on issue #6581: Moments Sketch custom aggregator URL: https://github.com/apache/incubator-druid/pull/6581#issuecomment-436856077 We just became aware of the underlying [paper](https://arxiv.org/abs/1803.01969) for this submission a few days ago and are still in the process of reviewing it. It is not up to me whether this code should be merged into Druid as a custom aggregator. However, the code has almost no Javadocs and the paper may be difficult for many users to fully understanding the trade-offs, advantages and disadvantages for how and when to use this kind of quantiles sketch as compared to the DataSketches (DS) quantiles sketches [already available](https://github.com/apache/incubator-druid/tree/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches) in Druid. Quoting from the paper's Abstract: > Empirical evaluation shows that the moments sketch can achieve less than 1 percent quantile error with 15× less overhead than comparable summaries, improving end query time in the MacroBase engine by up to 7× and the Druid engine by up to 60×. This is a very exciting claim, but understanding some of the assumptions behind this claim requires a bit of a deep-dive into the paper and unraveling what this sketch would be great at doing, and where it may not be so great. The Moments-Quantiles (M-Sketch) has been optimized for merge-time performance and for this metric the M-Sketch really shines. It's merge speed can be an order-of-magnitude faster than the DS-sketch. Meanwhile, obtaining a M-Sketch quantile estimate can be milliseconds, while the DS-Sketch is in the microsecond range. The paper defines the primary metric for sketch performance as _total query-time_, where the number of merges are large and the number of get-quantile estimate is rare, and perhaps for many Druid queries, this trade-off makes sense. The other major tradeoff is how sketch accuracy is defined and measured. 1. The paper is very clear: "The M-Sketch accuracy is dataset dependent". By comparison, the accuracy of the DS-Sketch is __data independent__. This difference has several practical real-world implications. The DS-Sketch doesn't care how ugly your input data stream is. It can have negative values, zeros, gaps and multiple spikes or blocks in its distribution of values, the values can range over many orders-of-magnitude, and the error guarantees of the DS-Sketch will still hold. However, the authors of the paper make it clear in several places that M-Sketch error gurantees only apply to relatively well-behaved and smooth value distributions (what the paper calls "non-pathological"). Unfortunately the term "non-pathological" is not well-defined and the user has no way of knowing whether or not any given input data stream is appropriately "non-pathological" without performing extensive brute-force quantile analysis of the stream and comparing it with the M-Sketch results. Another subtle difference between the two types of sketches is how error performance is measured and quoted. (This part is greatly simplified.) The M-sketch paper effectively defines a total area difference between two distribution curves; one being the real underlying distribution, and the other being the curve effectively modeled by the moments computed by the sketch. Then the paper defines the maximum error as effectively the maximum average error (the integral) of all queries along all points of the distribution. The DS-Sketch defines the maximum error as the maximum difference between the two curves at any point along the full range of the distribution. This means that the actual error from the M-Sketch could be huge (many times the quoted maximum error) for parts of the distribution, and very small for other parts of the curve so that, on average, if you perform queries over the full range of values of the distribution, the average error would be pretty good. But the user would have no clue _where_ along the distribution the error is very low, or outrageously high. In contrast, the DS-Sketch error guarantee is for any single query and for all queries. The other consequence of M-Sketch's error dependance on the data is that the M-sketch cannot give the user any before-the-fact guidance on what the error will be after the data has been sketched. The DS-Sketch does provide this guidance. So as long as you know a great deal about the underlying distributions of your data, or you don't care too much about error, and you are only concerned about total-query-time, go ahead and use the M-Sketch. If you don't know anything about the underlying distributions of your data and you do care about error, and slower total-query-time is an affordable trade-off, then I would advise you use the DataSketches/quantiles sketch. Cheers, Lee.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
