This is all good feedback.  Thank you all.  The plots that I have sent out
were just my first attempt at characterization, they are by no means cast
in stone!  I will definitely try Pavel's redefinition of error as well as
increasing the density of plot points in the tail.

Lee



On Mon, Sep 14, 2020 at 7:55 AM Vesely, Pavel <[email protected]>
wrote:

> Hi Justin,
> regarding K and epsilon: K is actually an initial setting of section size
> and it should be equal to Theta(1 / epsilon), regardless of the stream
> length SL. This is because SL is not known in advance and we maintain the
> correct section size dynamically. That is, each compactor starts assuming
> that SL is small (i.e. SL = O(K)), so it has just 3 sections of size K, and
> every time the number of sections needs to be increased because (# of
> compactions) > 2^(# of sections), we decrease K by a factor of sqrt(2) for
> that compactor and double the number of sections.
>
> So, it seems to me that in the implementation, the relation between K and
> \varepsilon is quite clear, up to a constant factor :-) The relation of K
> and epsilon is only complicated in the paper, where we involve SL in the
> definition of K.
>
> Pavel
>
> PS: I agree that we should have more samples of the error distribution
> towards the tail that we care about
>
> On 2020-09-14 14:32, Justin Thaler wrote:
>
> > that's fine, but 64 points over the entire range means we haven't
> actually confirmed that it's working properly as you approach the tail.
> It's not sampled well enough at the tails to observe that behavior.
>
> I'll second that the primary motivation behind studying relative-error
> quantiles is to get really good accuracy in the tail. So while we should,
> of course, carefully characterize the error behavior over the whole range
> of ranks, a key component of the characterization should be to understand
> in practice how the error is behaving in the tail. This may require
> significantly more fine-grained characterization testing in the tail than
> we are used to.
>
> In particular, a lot of the ongoing discussion may just stem from the fact
> that the quantiles being reported are just the integer multiples of 1/64.
> My initial reaction is that the benefit of the relative-error sketch will
> be more apparent if the characterization tests cut up the interval [0,
> 1/64] much more finely.
>
> For example, the theory guarantees that for Low-Rank-Accuracy, items of
> (absolute) rank 0, 1, ..., 1/\varepsilon have exact counts, items of rank
> (1/\varepsilon)+1, (1/\varepsilon)+2, ..., 2\varepsilon have absolute error
> at most 1, and so on.
>
> In terms of relative ranks (normalized by stream length SL), each interval
> of 1/\varepsilon items referenced above has length just \varepsilon/SL,
> i.e., we only guarantee exact counts for items of relative rank [0,
> \varepsilon/SL].
>
> If \varepsilon is, say, 0.01, and SL is 2^{17}, then we are talking about
> intervals of length less than 0.001. This means that I am not surprised to
> see that we do not get _exact_ counts out of the sketch in practice even in
> the tail if we are effectively defining the tail to be all items of
> relative rank [0, 1/64]~=[0, 0.015].
>
> It's a bit tricky to try to figure out what setting of \varepsilon is
> (implicitly) being used in the current characterization code since the
> parameter provided is K as opposed to \varepsilon.
>
> I think K here means each buffer of size ~2B is viewed by the algorithm as
> consisting of 2B/K blocks of length K. In which case, if I recall
> correctly, the paper wants to set K to
> Theta((1/\varepsilon)/\sqrt{log_2(\varepsilon *SL)}) (with complications
> arising because SL is not known in advance, and with sizeable constants
> hidden by the Big-Theta-notation), see Equation 16 of
> https://arxiv.org/pdf/2004.01668.pdf. So in principle, once you tell me
> SL, I can probably derive the value of \varepsilon corresponding to any
> value of K, but this is complicated in practice since the expression
> defining K in terms of \varepsilon and SL is complicated.
>
> This is not any sort of criticism of the information that Lee has provided
> :-). It is just an inherently annoying thing to relate K to the \varepsilon
> parameter appearing in the theoretical analysis of the algorithm.
>
>
> On Mon, Sep 14, 2020 at 3:19 AM Jon Malkin <[email protected]> wrote:
>
>> If you want to also test the whole range, that's fine, but 64 points over
>> the entire range means we haven't actually confirmed that it's working
>> properly as you approach the tail. It's not sampled well enough at the
>> tails to observe that behavior.
>>
>> Ok, first off, this is what the original quantiles sketch does to solve
>> equations going above 1.0/below 0.0:
>>
>> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L240
>>
>> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java#L251
>> In other words, we just clip the result to stay within the known range.
>> Because the raw formula doesn't care about such limits and will happily
>> return something outside that range otherwise.
>>
>> And our existing bounds on the classic quantiles sketch are too
>> pessimistic most of the time. I mean take this simple code example:
>>
>> public void rankErrorTest() {
>>   int n = 1 << 20;  int k = 1 << 5;  UpdateDoublesSketch sketch = 
>> DoublesSketch.builder().setK(k).build();  for (int i = 0; i < n-1; i++)
>>     sketch.update(rand.nextGaussian());  System.err.println("Retained: " + 
>> sketch.getRetainedItems() + "\tNormRakErr: " + 
>> sketch.getNormalizedRankError(false));  sketch.update(rand.nextGaussian());  
>> System.err.println("Retained: " + sketch.getRetainedItems() + "\tNormRakErr: 
>> " + sketch.getNormalizedRankError(false));}
>>
>> I insert n-1 items, then print retained items and rank error. Then I
>> insert 1 more and do the same. The output is:
>> Retained: 511 NormRakErr: 0.05415609535374165
>> Retained: 32 NormRakErr: 0.05415609535374165
>> And that makes sense since our rank error function doesn't take into
>> account n at all. As a result, we have no better an error estimate when
>> retaining 511 items compared to 32. The error bound is clearly being overly
>> conservative, not taking advantage of how many items we're actually
>> retaining. It's the 99th percentile plot, but we see this in our
>> documentation here:
>> http://datasketches.apache.org/docs/img/quantiles/qds-7-compact-accuracy-1k-99pct-20180224.png
>>
>> With the error guarantee in the paper of |\hat{R}(y) - R(y)| <=
>> \varepsilon R(y), the linear bounds seems correct. In fact, anything with
>> error increasing exponentially seems suspiciously odd. But perhaps I didn't
>> think through the details of the rank definition enough.
>>
>> Anyway, maybe I'm just confused by what exactly you mean by "predict the
>> actual error behavior"?
>>
>>   jon
>>
>>
>> On Sun, Sep 13, 2020, 10:02 PM leerho <[email protected]> wrote:
>>
>>> I disagree.  The bounds on our current sketches do a pretty good job of
>>> predicting the actual error behavior.  That is why we have gone to the
>>> trouble of allowing the user to choose the size of the confidence interval
>>> and in the case of the theta Sketches we actually track the growth of the
>>> error Contour in the early stages of Estimation mode.   The better we can
>>> specify the error, the more confident the user will be in the result.
>>>
>>> For these early characterization tests it is critical to understand the
>>> error behavior over the entire range of ranks.  How we ultimately specify
>>> and document how to properly use this sketch we will do later.
>>>
>>> Lee.
>>>
>>>
>>> On Sun, Sep 13, 2020 at 1:45 PM Jon Malkin <[email protected]> wrote:
>>>
>>>> Our existing bounds, especially for the original quantities sketch,
>>>> don't necessarily predict the actual error behavior. They're an upper bound
>>>> on it. Especially right before one of the major cascading compaction
>>>> rounds, our error may be substantially better than the bound.
>>>>
>>>> I also feel like characterization plots going from 0.0-1.0 kind of
>>>> misses the point of the relative error quantiles sketch. It's kind of
>>>> useful to know what it's doing everywhere but if you care about that whole
>>>> range you should be using KLL, not this sketch. If you care about accuracy
>>>> over the whole range and also high precision at the tail, you should be
>>>> using both sketches together. The interesting thing with this sketch would
>>>> be to focus only on error in, say, the outer 5% at most. Maybe even a log
>>>> scale plot approaching the edge of the range.
>>>>
>>>>   jon
>>>>
>>> --
>>> From my cell phone.
>>>
>>
>

Reply via email to