leerho commented on issue #6581: Moments Sketch custom aggregator
URL: https://github.com/apache/incubator-druid/pull/6581#issuecomment-436856077
 
 
   We just became aware of the underlying 
[paper](https://arxiv.org/abs/1803.01969) for this submission a few days ago 
and are still in the process of reviewing it.  
   
   It is not up to me whether this code should be merged into Druid as a custom 
aggregator. However, the code has almost no Javadocs and the paper may be 
difficult for many users to fully understanding the trade-offs, advantages and 
disadvantages for how and when to use this kind of quantiles sketch as compared 
to the DataSketches (DS) quantiles sketches [already 
available](https://github.com/apache/incubator-druid/tree/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches)
 in Druid.  
   
   Quoting from the paper's Abstract:
   
   > Empirical evaluation shows that the moments sketch can achieve less than 1 
percent quantile error with 15× less overhead than comparable summaries, 
improving end query time in the MacroBase engine by up to 7× and the Druid 
engine by up to 60×.
   
   This is a very exciting claim, but understanding some of the assumptions 
behind this claim requires a bit of a deep-dive into the paper and unraveling 
what this sketch would be great at doing, and where it may not be so great.
   
   The Moments-Quantiles (M-Sketch) has been optimized for merge-time 
performance and for this metric the M-Sketch really shines.  It's merge speed 
can be an order-of-magnitude faster than the DS-sketch.  Meanwhile, obtaining a 
M-Sketch quantile estimate can be milliseconds, while the DS-Sketch is in the 
microsecond range. 
   
   The paper defines the primary metric for sketch performance as _total 
query-time_, where the number of merges are large and the number of 
get-quantile estimate is rare, and perhaps for many Druid queries, this 
trade-off makes sense.
    
   The other major tradeoff is how sketch accuracy is defined and measured. 
   
   1. The paper is very clear: "The M-Sketch accuracy is dataset dependent". By 
comparison, the accuracy of the DS-Sketch is __data independent__. 
   
   This difference has several practical real-world implications. 
   
   The DS-Sketch doesn't care how ugly your input data stream is.  It can have 
negative values, zeros, gaps and multiple spikes or blocks in its distribution 
of values, the values can range over many orders-of-magnitude, and the error 
guarantees of the DS-Sketch will still hold. 
   
   However, the authors of the paper make it clear in several places that 
M-Sketch error gurantees only apply to relatively well-behaved and smooth value 
distributions (what the paper calls "non-pathological").  Unfortunately the 
term "non-pathological" is not well-defined and the user has no way of knowing 
whether or not any given input data stream is appropriately "non-pathological" 
without performing extensive brute-force quantile analysis of the stream and 
comparing it with the M-Sketch results. 
   
   Another subtle difference between the two types of sketches is how error 
performance is measured and quoted. 
   
   (This part is greatly simplified.) 
   
   The M-sketch paper effectively defines a total area difference between two 
distribution curves; one being the real underlying distribution, and the other 
being the curve effectively modeled by the moments computed by the sketch. Then 
the paper defines the maximum error as effectively the maximum average error 
(the integral) of all queries along all points of the distribution. The 
DS-Sketch defines the maximum error as the maximum difference between the two 
curves at any point along the full range of the distribution. 
   
   This means that the actual error from the M-Sketch could be huge (many times 
the quoted maximum error) for parts of the distribution, and very small for 
other parts of the curve so that, on average, if you perform queries over the 
full range of values of the distribution, the average error would be pretty 
good. But the user would have no clue _where_ along the distribution the error 
is very low, or outrageously high.
   
   In contrast, the DS-Sketch error guarantee is for any single query and for 
all queries. 
   
   The other consequence of M-Sketch's error dependance on the data is that the 
M-sketch cannot give the user any before-the-fact guidance on what the error 
will be after the data has been sketched. The DS-Sketch does provide this 
guidance.
   
   So as long as you know a great deal about the underlying distributions of 
your data, or you don't care too much about error, and you are only concerned 
about total-query-time, go ahead and use the M-Sketch. 
   
   If you don't know anything about the underlying distributions of your data 
and you do care about error, and slower total-query-time is an affordable 
trade-off, then I would advise you use the DataSketches/quantiles sketch.
   
   Cheers,
   
   Lee.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to