[GitHub] leerho commented on issue #6581: Moments Sketch custom aggregator

GitBox Wed, 07 Nov 2018 18:57:29 -0800

leerho commented on issue #6581: Moments Sketch custom aggregator
URL: https://github.com/apache/incubator-druid/pull/6581#issuecomment-436856077

We just became aware of the underlying
[paper](https://arxiv.org/abs/1803.01969) for this submission a few days ago
and are still in the process of reviewing it.

It is not up to me whether this code should be merged into Druid as a custom
aggregator. However, the code has almost no Javadocs and the paper may be
difficult for many users to fully understanding the trade-offs, advantages and
disadvantages for how and when to use this kind of quantiles sketch as compared
to the DataSketches (DS) quantiles sketches [already
available](https://github.com/apache/incubator-druid/tree/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches)
in Druid.

Quoting from the paper's Abstract:

> Empirical evaluation shows that the moments sketch can achieve less than 1
percent quantile error with 15× less overhead than comparable summaries,
improving end query time in the MacroBase engine by up to 7× and the Druid
engine by up to 60×.

This is a very exciting claim, but understanding some of the assumptions
behind this claim requires a bit of a deep-dive into the paper and unraveling
what this sketch would be great at doing, and where it may not be so great.

The Moments-Quantiles (M-Sketch) has been optimized for merge-time
performance and for this metric the M-Sketch really shines. It's merge speed
can be an order-of-magnitude faster than the DS-sketch. Meanwhile, obtaining a
M-Sketch quantile estimate can be milliseconds, while the DS-Sketch is in the
microsecond range.

The paper defines the primary metric for sketch performance as _total
query-time_, where the number of merges are large and the number of
get-quantile estimate is rare, and perhaps for many Druid queries, this
trade-off makes sense.

The other major tradeoff is how sketch accuracy is defined and measured.

1. The paper is very clear: "The M-Sketch accuracy is dataset dependent". By
comparison, the accuracy of the DS-Sketch is __data independent__.

This difference has several practical real-world implications.

The DS-Sketch doesn't care how ugly your input data stream is. It can have
negative values, zeros, gaps and multiple spikes or blocks in its distribution
of values, the values can range over many orders-of-magnitude, and the error
guarantees of the DS-Sketch will still hold.

However, the authors of the paper make it clear in several places that
M-Sketch error gurantees only apply to relatively well-behaved and smooth value
distributions (what the paper calls "non-pathological"). Unfortunately the
term "non-pathological" is not well-defined and the user has no way of knowing
whether or not any given input data stream is appropriately "non-pathological"
without performing extensive brute-force quantile analysis of the stream and
comparing it with the M-Sketch results.

Another subtle difference between the two types of sketches is how error
performance is measured and quoted.

(This part is greatly simplified.)

The M-sketch paper effectively defines a total area difference between two
distribution curves; one being the real underlying distribution, and the other
being the curve effectively modeled by the moments computed by the sketch. Then
the paper defines the maximum error as effectively the maximum average error
(the integral) of all queries along all points of the distribution. The
DS-Sketch defines the maximum error as the maximum difference between the two
curves at any point along the full range of the distribution.

This means that the actual error from the M-Sketch could be huge (many times
the quoted maximum error) for parts of the distribution, and very small for
other parts of the curve so that, on average, if you perform queries over the
full range of values of the distribution, the average error would be pretty
good. But the user would have no clue _where_ along the distribution the error
is very low, or outrageously high.

In contrast, the DS-Sketch error guarantee is for any single query and for
all queries.

The other consequence of M-Sketch's error dependance on the data is that the
M-sketch cannot give the user any before-the-fact guidance on what the error
will be after the data has been sketched. The DS-Sketch does provide this
guidance.

So as long as you know a great deal about the underlying distributions of
your data, or you don't care too much about error, and you are only concerned
about total-query-time, go ahead and use the M-Sketch.

If you don't know anything about the underlying distributions of your data
and you do care about error, and slower total-query-time is an affordable
trade-off, then I would advise you use the DataSketches/quantiles sketch.

Cheers,

Lee.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] leerho commented on issue #6581: Moments Sketch custom aggregator

Reply via email to