Our existing bounds, especially for the original quantities sketch, don't
necessarily predict the actual error behavior. They're an upper bound on
it. Especially right before one of the major cascading compaction rounds,
our error may be substantially better than the bound.

I also feel like characterization plots going from 0.0-1.0 kind of misses
the point of the relative error quantiles sketch. It's kind of useful to
know what it's doing everywhere but if you care about that whole range you
should be using KLL, not this sketch. If you care about accuracy over the
whole range and also high precision at the tail, you should be using both
sketches together. The interesting thing with this sketch would be to focus
only on error in, say, the outer 5% at most. Maybe even a log scale plot
approaching the edge of the range.

  jon

On Sun, Sep 13, 2020 at 1:10 PM leerho <[email protected]> wrote:

> Pavel,
> Yes, I was doing something wrong! Sorry! I was confusing the predicted
> bounds on a rank vs the error shown in the graphs where the absolute rank
> is subtracted out just showing the difference.
>
> Nonetheless, we do need functions that do not go negative for small K.
>
> Lee.
>
> On Sun, Sep 13, 2020 at 12:50 PM leerho <[email protected]> wrote:
>
>> Pavel,
>> I have looked very carefully at your p-code for getRankUB and getRankLB,
>> and they both produce nonsensical values.
>>
>>    - The getMaximumRSE(k), with K=4 (the smallest value of K) multiplied
>>    by 3 SD = 1.23, which is > 1.  It should never produce values > 1 for all
>>    possible legal values of K. This will cause the LB to go negative.
>>    - Given a K and SD, the function (1 +/- SD * RSE) is a constant.
>>    This multiplied by the rank will produce a linear increasing bounds from
>>    the lowest rank to the highest.  And this does not model the error
>>    properties for the HRA case which has decreasing error for the high ranks
>>    or the LRA case where the error is pretty flat for most of the range.
>>    - We need equations that not only predict the actual error behavior,
>>    but also for the lower bound, never produces values < 0 (for all legal
>>    values of K), which these equations will.
>>
>> Comments:
>>
>>    - This is not a bug, but a user design principle.  I don't think it
>>    is a good idea to have the user specify a value to obtain the rank error.
>>    We don't want to give the user the idea that the error bounds are related
>>    to values at all, only ranks.  I realize we can translate the given value
>>    to an estimated rank, but let's have the user do that and document to the
>>    user the caveats about doing that.
>>    - We have been training our users to become accustomed to specifying
>>    and obtaining ranks as normalized [0,1]. That is how our other quantile
>>    sketches work.  Most users think in terms of percentiles anyway so
>>    normalized ranks are pretty natural.
>>
>> Perhaps I am still doing something wrong?
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, Sep 12, 2020 at 11:26 AM leerho <[email protected]> wrote:
>>
>>> Hi Pavel,
>>> I think we are in agreement.  I take it as a positive that you haven't
>>> found any flaw in the implementation of the relative error quantiles
>>> algorithm.   What we are discussing now is definitions in how to define
>>> rank especially with respect to HRA and LRA, and philosophically, what kind
>>> of error distribution as a function of rank will users want and be easiest
>>> to specify and explain.
>>>
>>> My intuition is that all plots should look roughly the same, up to
>>>> mirroring
>>>> along the Y axis.
>>>
>>>
>>> This statement is a little puzzling as it will only be true if we choose
>>> definitions of rank appropriately if the user selects LRA or HRA.  As my
>>> plots reveal, if we keep the definition of rank the same, switching  from
>>> HRA to LRA has dramatically different error distribution as a  f(rank).  I
>>> agree with your formulas for relative error, except that all of our ranks
>>> are already normalized by SL, so I would replace SL in your formulas by
>>> 1.0.
>>>
>>> I still want to add to these plots your a priori calculation of error to
>>> see where it lies compared to the measured error.
>>>
>>> I gather from your comments that what you had in mind when writing the
>>> paper was a rank error distribution that looks like plots 3 and 4 above and
>>> not plots 1 and 2.  However, I feel that the rank error distribution shown
>>> in plots 1 and 2 would be easier for users to understand.  We will probably
>>> leave the decision totally up to the user as to what they prefer, however,
>>> this will require more documentation to explain all the different error
>>> behaviors  :( .
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to