Hi Justin,
regarding K and epsilon: K is actually an initial setting of section size and
it should be equal to Theta(1 / epsilon), regardless of the stream length SL.
This is because SL is not known in advance and we maintain the correct section
size dynamically. That is, each compactor starts assuming that SL is small
(i.e. SL = O(K)), so it has just 3 sections of size K, and every time the
number of sections needs to be increased because (# of compactions) > 2^(# of
sections), we decrease K by a factor of sqrt(2) for that compactor and double
the number of sections.
So, it seems to me that in the implementation, the relation between K and
\varepsilon is quite clear, up to a constant factor :-) The relation of K and
epsilon is only complicated in the paper, where we involve SL in the definition
of K.
Pavel
PS: I agree that we should have more samples of the error distribution towards
the tail that we care about
On 2020-09-14 14:32, Justin Thaler wrote:
> that's fine, but 64 points over the entire range means we haven't actually
> confirmed that it's working properly as you approach the tail. It's not
> sampled well enough at the tails to observe that behavior.
I'll second that the primary motivation behind studying relative-error
quantiles is to get really good accuracy in the tail. So while we should, of
course, carefully characterize the error behavior over the whole range of
ranks, a key component of the characterization should be to understand in
practice how the error is behaving in the tail. This may require significantly
more fine-grained characterization testing in the tail than we are used to.
In particular, a lot of the ongoing discussion may just stem from the fact that
the quantiles being reported are just the integer multiples of 1/64. My initial
reaction is that the benefit of the relative-error sketch will be more apparent
if the characterization tests cut up the interval [0, 1/64] much more finely.
For example, the theory guarantees that for Low-Rank-Accuracy, items of
(absolute) rank 0, 1, ..., 1/\varepsilon have exact counts, items of rank
(1/\varepsilon)+1, (1/\varepsilon)+2, ..., 2\varepsilon have absolute error at
most 1, and so on.
In terms of relative ranks (normalized by stream length SL), each interval of
1/\varepsilon items referenced above has length just \varepsilon/SL, i.e., we
only guarantee exact counts for items of relative rank [0, \varepsilon/SL].
If \varepsilon is, say, 0.01, and SL is 2^{17}, then we are talking about
intervals of length less than 0.001. This means that I am not surprised to see
that we do not get _exact_ counts out of the sketch in practice even in the
tail if we are effectively defining the tail to be all items of relative rank
[0, 1/64]~=[0, 0.015].
It's a bit tricky to try to figure out what setting of \varepsilon is
(implicitly) being used in the current characterization code since the
parameter provided is K as opposed to \varepsilon.
I think K here means each buffer of size ~2B is viewed by the algorithm as
consisting of 2B/K blocks of length K. In which case, if I recall correctly,
the paper wants to set K to Theta((1/\varepsilon)/\sqrt{log_2(\varepsilon
*SL)}) (with complications arising because SL is not known in advance, and with
sizeable constants hidden by the Big-Theta-notation), see Equation 16 of
https://arxiv.org/pdf/2004.01668.pdf. So in principle, once you tell me SL, I
can probably derive the value of \varepsilon corresponding to any value of K,
but this is complicated in practice since the expression defining K in terms of
\varepsilon and SL is complicated.
This is not any sort of criticism of the information that Lee has provided :-).
It is just an inherently annoying thing to relate K to the \varepsilon
parameter appearing in the theoretical analysis of the algorithm.
On Mon, Sep 14, 2020 at 3:19 AM Jon Malkin
<[email protected]<mailto:[email protected]>> wrote:
If you want to also test the whole range, that's fine, but 64 points over the
entire range means we haven't actually confirmed that it's working properly as
you approach the tail. It's not sampled well enough at the tails to observe
that behavior.
Ok, first off, this is what the original quantiles sketch does to solve
equations going above 1.0/below 0.0:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
In other words, we just clip the result to stay within the known range. Because
the raw formula doesn't care about such limits and will happily return
something outside that range otherwise.
And our existing bounds on the classic quantiles sketch are too pessimistic
most of the time. I mean take this simple code example:
public void rankErrorTest() {
int n = 1 << 20;
int k = 1 << 5;
UpdateDoublesSketch sketch = DoublesSketch.builder().setK(k).build();
for (int i = 0; i < n-1; i++)
sketch.update(rand.nextGaussian());
System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr:
" + sketch.getNormalizedRankError(false));
sketch.update(rand.nextGaussian());
System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr:
" + sketch.getNormalizedRankError(false));
}
I insert n-1 items, then print retained items and rank error. Then I insert 1
more and do the same. The output is:
Retained: 511 NormRakErr: 0.05415609535374165
Retained: 32 NormRakErr: 0.05415609535374165
And that makes sense since our rank error function doesn't take into account n
at all. As a result, we have no better an error estimate when retaining 511
items compared to 32. The error bound is clearly being overly conservative, not
taking advantage of how many items we're actually retaining. It's the 99th
percentile plot, but we see this in our documentation here:
http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png
With the error guarantee in the paper of |\hat{R}(y) - R(y)| <= \varepsilon
R(y), the linear bounds seems correct. In fact, anything with error increasing
exponentially seems suspiciously odd. But perhaps I didn't think through the
details of the rank definition enough.
Anyway, maybe I'm just confused by what exactly you mean by "predict the actual
error behavior"?
jon
On Sun, Sep 13, 2020, 10:02 PM leerho
<[email protected]<mailto:[email protected]>> wrote:
I disagree. The bounds on our current sketches do a pretty good job of
predicting the actual error behavior. That is why we have gone to the trouble
of allowing the user to choose the size of the confidence interval and in the
case of the theta Sketches we actually track the growth of the error Contour in
the early stages of Estimation mode. The better we can specify the error, the
more confident the user will be in the result.
For these early characterization tests it is critical to understand the error
behavior over the entire range of ranks. How we ultimately specify and
document how to properly use this sketch we will do later.
Lee.
On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin
<[email protected]<mailto:[email protected]>> wrote:
Our existing bounds, especially for the original quantities sketch, don't
necessarily predict the actual error behavior. They're an upper bound on it.
Especially right before one of the major cascading compaction rounds, our error
may be substantially better than the bound.
I also feel like characterization plots going from 0.0-1.0 kind of misses the
point of the relative error quantiles sketch. It's kind of useful to know what
it's doing everywhere but if you care about that whole range you should be
using KLL, not this sketch. If you care about accuracy over the whole range and
also high precision at the tail, you should be using both sketches together.
The interesting thing with this sketch would be to focus only on error in, say,
the outer 5% at most. Maybe even a log scale plot approaching the edge of the
range.
jon
--
From my cell phone.