If you want to also test the whole range, that's fine, but 64 points over the entire range means we haven't actually confirmed that it's working properly as you approach the tail. It's not sampled well enough at the tails to observe that behavior.
Ok, first off, this is what the original quantiles sketch does to solve equations going above 1.0/below 0.0: https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240 https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251 In other words, we just clip the result to stay within the known range. Because the raw formula doesn't care about such limits and will happily return something outside that range otherwise. And our existing bounds on the classic quantiles sketch are too pessimistic most of the time. I mean take this simple code example: public void rankErrorTest() { int n = 1 << 20; int k = 1 << 5; UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build(); for (int i = 0; i < n-1; i++) sketch.update(rand.nextGaussian()); System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false)); sketch.update(rand.nextGaussian()); System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: " + sketch.getNormalizedRankError(false)); } I insert n-1 items, then print retained items and rank error. Then I insert 1 more and do the same. The output is: Retained: 511 NormRakErr: 0.05415609535374165 Retained: 32 NormRakErr: 0.05415609535374165 And that makes sense since our rank error function doesn't take into account n at all. As a result, we have no better an error estimate when retaining 511 items compared to 32. The error bound is clearly being overly conservative, not taking advantage of how many items we're actually retaining. It's the 99th percentile plot, but we see this in our documentation here: http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png With the error guarantee in the paper of |\hat{R}(y) - R(y)| <= \varepsilon R(y), the linear bounds seems correct. In fact, anything with error increasing exponentially seems suspiciously odd. But perhaps I didn't think through the details of the rank definition enough. Anyway, maybe I'm just confused by what exactly you mean by "predict the actual error behavior"? jon On Sun, Sep 13, 2020, 10:02 PM leerho <[email protected]> wrote: > I disagree. The bounds on our current sketches do a pretty good job of > predicting the actual error behavior. That is why we have gone to the > trouble of allowing the user to choose the size of the confidence interval > and in the case of the theta Sketches we actually track the growth of the > error Contour in the early stages of Estimation mode. The better we can > specify the error, the more confident the user will be in the result. > > For these early characterization tests it is critical to understand the > error behavior over the entire range of ranks. How we ultimately specify > and document how to properly use this sketch we will do later. > > Lee. > > > On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <[email protected]> wrote: > >> Our existing bounds, especially for the original quantities sketch, don't >> necessarily predict the actual error behavior. They're an upper bound on >> it. Especially right before one of the major cascading compaction rounds, >> our error may be substantially better than the bound. >> >> I also feel like characterization plots going from 0.0-1.0 kind of misses >> the point of the relative error quantiles sketch. It's kind of useful to >> know what it's doing everywhere but if you care about that whole range you >> should be using KLL, not this sketch. If you care about accuracy over the >> whole range and also high precision at the tail, you should be using both >> sketches together. The interesting thing with this sketch would be to focus >> only on error in, say, the outer 5% at most. Maybe even a log scale plot >> approaching the edge of the range. >> >> jon >> > -- > From my cell phone. >
