Re: Java code for Relative Error Quantiles Sketch

Jon Malkin Mon, 14 Sep 2020 00:19:19 -0700

If you want to also test the whole range, that's fine, but 64 points over
the entire range means we haven't actually confirmed that it's working
properly as you approach the tail. It's not sampled well enough at the
tails to observe that behavior.

Ok, first off, this is what the original quantiles sketch does to solve
equations going above 1.0/below 0.0:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
In other words, we just clip the result to stay within the known range.
Because the raw formula doesn't care about such limits and will happily
return something outside that range otherwise.

And our existing bounds on the classic quantiles sketch are too pessimistic
most of the time. I mean take this simple code example:

public void rankErrorTest() {
  int n = 1 << 20;
  int k = 1 << 5;

  UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();
  for (int i = 0; i < n-1; i++)
    sketch.update(rand.nextGaussian());
  System.err.println("Retained: " + sketch.getRetainedItems() +
"\tNormRakErr: " + sketch.getNormalizedRankError(false));

  sketch.update(rand.nextGaussian());
  System.err.println("Retained: " + sketch.getRetainedItems() +
"\tNormRakErr: " + sketch.getNormalizedRankError(false));
}

I insert n-1 items, then print retained items and rank error. Then I insert
1 more and do the same. The output is:
Retained: 511 NormRakErr: 0.05415609535374165
Retained: 32 NormRakErr: 0.05415609535374165
And that makes sense since our rank error function doesn't take into
account n at all. As a result, we have no better an error estimate when
retaining 511 items compared to 32. The error bound is clearly being overly
conservative, not taking advantage of how many items we're actually
retaining. It's the 99th percentile plot, but we see this in our
documentation here:
http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png

With the error guarantee in the paper of |\hat{R}(y) - R(y)| <= \varepsilon
R(y), the linear bounds seems correct. In fact, anything with error
increasing exponentially seems suspiciously odd. But perhaps I didn't think
through the details of the rank definition enough.

Anyway, maybe I'm just confused by what exactly you mean by "predict the
actual error behavior"?

  jon

On Sun, Sep 13, 2020, 10:02 PM leerho <[email protected]> wrote:

> I disagree.  The bounds on our current sketches do a pretty good job of
> predicting the actual error behavior.  That is why we have gone to the
> trouble of allowing the user to choose the size of the confidence interval
> and in the case of the theta Sketches we actually track the growth of the
> error Contour in the early stages of Estimation mode.   The better we can
> specify the error, the more confident the user will be in the result.
>
> For these early characterization tests it is critical to understand the
> error behavior over the entire range of ranks.  How we ultimately specify
> and document how to properly use this sketch we will do later.
>
> Lee.
>
>
> On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <[email protected]> wrote:
>
>> Our existing bounds, especially for the original quantities sketch, don't
>> necessarily predict the actual error behavior. They're an upper bound on
>> it. Especially right before one of the major cascading compaction rounds,
>> our error may be substantially better than the bound.
>>
>> I also feel like characterization plots going from 0.0-1.0 kind of misses
>> the point of the relative error quantiles sketch. It's kind of useful to
>> know what it's doing everywhere but if you care about that whole range you
>> should be using KLL, not this sketch. If you care about accuracy over the
>> whole range and also high precision at the tail, you should be using both
>> sketches together. The interesting thing with this sketch would be to focus
>> only on error in, say, the outer 5% at most. Maybe even a log scale plot
>> approaching the edge of the range.
>>
>>   jon
>>
> --
> From my cell phone.
>

Re: Java code for Relative Error Quantiles Sketch

Reply via email to