> that's fine, but 64 points over the entire range means we haven't
actually confirmed that it's working properly as you approach the tail.
It's not sampled well enough at the tails to observe that behavior.

I'll second that the primary motivation behind studying relative-error
quantiles is to get really good accuracy in the tail. So while we should,
of course, carefully characterize the error behavior over the whole range
of ranks, a key component of the characterization should be to understand
in practice how the error is behaving in the tail. This may require
significantly more fine-grained characterization testing in the tail than
we are used to.

In particular, a lot of the ongoing discussion may just stem from the fact
that the quantiles being reported are just the integer multiples of 1/64.
My initial reaction is that the benefit of the relative-error sketch will
be more apparent if the characterization tests cut up the interval [0,
1/64] much more finely.

For example, the theory guarantees that for Low-Rank-Accuracy, items of
(absolute) rank 0, 1, ..., 1/\varepsilon have exact counts, items of rank
(1/\varepsilon)+1, (1/\varepsilon)+2, ..., 2\varepsilon have absolute error
at most 1, and so on.

In terms of relative ranks (normalized by stream length SL), each interval
of 1/\varepsilon items referenced above has length just \varepsilon/SL,
i.e., we only guarantee exact counts for items of relative rank [0,
\varepsilon/SL].

If \varepsilon is, say, 0.01, and SL is 2^{17}, then we are talking about
intervals of length less than 0.001. This means that I am not surprised to
see that we do not get _exact_ counts out of the sketch in practice even in
the tail if we are effectively defining the tail to be all items of
relative rank [0, 1/64]~=[0, 0.015].

It's a bit tricky to try to figure out what setting of \varepsilon is
(implicitly) being used in the current characterization code since the
parameter provided is K as opposed to \varepsilon.

I think K here means each buffer of size ~2B is viewed by the algorithm as
consisting of 2B/K blocks of length K. In which case, if I recall
correctly, the paper wants to set K to
Theta((1/\varepsilon)/\sqrt{log_2(\varepsilon *SL)}) (with complications
arising because SL is not known in advance, and with sizeable constants
hidden by the Big-Theta-notation), see Equation 16 of
https://arxiv.org/pdf/2004.01668.pdf. So in principle, once you tell me SL,
I can probably derive the value of \varepsilon corresponding to any value
of K, but this is complicated in practice since the expression defining K
in terms of \varepsilon and SL is complicated.

This is not any sort of criticism of the information that Lee has provided
:-). It is just an inherently annoying thing to relate K to the \varepsilon
parameter appearing in the theoretical analysis of the algorithm.


On Mon, Sep 14, 2020 at 3:19 AM Jon Malkin <[email protected]> wrote:

> If you want to also test the whole range, that's fine, but 64 points over
> the entire range means we haven't actually confirmed that it's working
> properly as you approach the tail. It's not sampled well enough at the
> tails to observe that behavior.
>
> Ok, first off, this is what the original quantiles sketch does to solve
> equations going above 1.0/below 0.0:
>
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
>
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
> In other words, we just clip the result to stay within the known range.
> Because the raw formula doesn't care about such limits and will happily
> return something outside that range otherwise.
>
> And our existing bounds on the classic quantiles sketch are too
> pessimistic most of the time. I mean take this simple code example:
>
> public void rankErrorTest() {
>   int n = 1 << 20;
>   int k = 1 << 5;
>
>   UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();
>   for (int i = 0; i < n-1; i++)
>     sketch.update(rand.nextGaussian());
>   System.err.println("Retained: " + sketch.getRetainedItems() + 
> "\tNormRakErr: " + sketch.getNormalizedRankError(false));
>
>   sketch.update(rand.nextGaussian());
>   System.err.println("Retained: " + sketch.getRetainedItems() + 
> "\tNormRakErr: " + sketch.getNormalizedRankError(false));
> }
>
> I insert n-1 items, then print retained items and rank error. Then I
> insert 1 more and do the same. The output is:
> Retained: 511 NormRakErr: 0.05415609535374165
> Retained: 32 NormRakErr: 0.05415609535374165
> And that makes sense since our rank error function doesn't take into
> account n at all. As a result, we have no better an error estimate when
> retaining 511 items compared to 32. The error bound is clearly being overly
> conservative, not taking advantage of how many items we're actually
> retaining. It's the 99th percentile plot, but we see this in our
> documentation here:
> http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png
>
> With the error guarantee in the paper of |\hat{R}(y) - R(y)| <=
> \varepsilon R(y), the linear bounds seems correct. In fact, anything with
> error increasing exponentially seems suspiciously odd. But perhaps I didn't
> think through the details of the rank definition enough.
>
> Anyway, maybe I'm just confused by what exactly you mean by "predict the
> actual error behavior"?
>
>   jon
>
>
> On Sun, Sep 13, 2020, 10:02 PM leerho <[email protected]> wrote:
>
>> I disagree.  The bounds on our current sketches do a pretty good job of
>> predicting the actual error behavior.  That is why we have gone to the
>> trouble of allowing the user to choose the size of the confidence interval
>> and in the case of the theta Sketches we actually track the growth of the
>> error Contour in the early stages of Estimation mode.   The better we can
>> specify the error, the more confident the user will be in the result.
>>
>> For these early characterization tests it is critical to understand the
>> error behavior over the entire range of ranks.  How we ultimately specify
>> and document how to properly use this sketch we will do later.
>>
>> Lee.
>>
>>
>> On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <[email protected]> wrote:
>>
>>> Our existing bounds, especially for the original quantities sketch,
>>> don't necessarily predict the actual error behavior. They're an upper bound
>>> on it. Especially right before one of the major cascading compaction
>>> rounds, our error may be substantially better than the bound.
>>>
>>> I also feel like characterization plots going from 0.0-1.0 kind of
>>> misses the point of the relative error quantiles sketch. It's kind of
>>> useful to know what it's doing everywhere but if you care about that whole
>>> range you should be using KLL, not this sketch. If you care about accuracy
>>> over the whole range and also high precision at the tail, you should be
>>> using both sketches together. The interesting thing with this sketch would
>>> be to focus only on error in, say, the outer 5% at most. Maybe even a log
>>> scale plot approaching the edge of the range.
>>>
>>>   jon
>>>
>> --
>> From my cell phone.
>>
>

Reply via email to