Re: Java code for Relative Error Quantiles Sketch

Pavel Vesely Sat, 12 Sep 2020 06:09:58 -0700

Hi Lee,

my intuition is that all plots should look roughly the same, up to mirroringalong the Y axis. In particular, the LRA case with ranks defined by >= should bethe same as the HRA case with <= ranks. (Actually, does "LRA with >= ranks" meanthat we want to be more precise for low values that however have high ranks dueto the changed definition of rank?)

I actually think that the behavior for LRA (with <= or < ranks) is correct asyour definition of the relative error is correct. In other words, the curvesshould be approx. flat, but of course, there's 0 error for very low ranks withprobability 1 and the maximum value is also correct deterministically. So, theLRA plot is not a big surprise for me :-)

For HRA, I suggest to use re[i] = (r[i] - tr[i]) / (SL - tr[i]) and then you getthe same behavior as for LRA (only mirrored). I also think you need to do thesame modification of re[i] for LRA with >= ranks. So, my suggestion for definingre[i] is:

- re[i] = r[i] / tr[i] - 1 -- for LRA with <= ranks and for HRA with >= ranks,
- (r[i] - tr[i]) / (SL - tr[i]) -- for HRA with <= ranks and for LRA with >= 
ranks,

Does it make sense to you?

Cheers,
Pavel

On 12.09.2020 at 3:24 leerho wrote:

Folks,
Here are some of the first characterization results.

Test Design:

  * Stream Lengths (SL), 2^11, 2^14, 2^17, 2^20
  * Stream values, natural numbers from 1 to SL, expressed as floats (32 bits).
  * X-axis Trial Points: 64 evenly spaced values between 0 and SL, where the
    top value is always SL. (tp[])
  * Generate the true ranks of the 64 values: (tr[])
  * Num Trials: 2^12 = 4096.
  * Run the 4096 Trials; For each Trial:
      o Shuffle the stream
      o Feed the stream into the ReqSketch (sk)
      o perform a bulk query: r[] = sk.getRanks(tp[]);
      o for each rank compute the relative error: re[i] = r[i] / tr[i] - 1;
      o feed the re[] into an array of 64 error quantiles sketches (KLL or the
        older Quantiles sketch):  Qerr[].
  * End of Trials
 *

  * From each of the 64 Qerr[] do getQuantiles(G[]), where G[] is an array of
    6 Gaussian ranks corresponding to +/-1SD, +/-2SD, Median, and 1.0.  These
    values are plotted for all 64 tr[] values along the x-axis.

First I configured the ReqSketch with High Rank Accuracy, ">=" comparisoncriteria, and K=50. It produced the following plot:

ReqErrorK50SL11HRA.png

This is actually what I expected! a very smooth transition from the high errorof the low ranks to the excellent error for the high ranks. For the very lowranks the error shoots off ultimately to infinity as those values may havedisappeared from the sketch.The 3 curves above zero represent +1 sd, +2 sd, and 1.0. The 2 curves beloware -1 sd and -2 sd.

What is nice is that +/-2 sd represents a 95.4% CI and that starts at ranks0.2 and above. Running this same test with 2^20 values produces the nextplot, which behaves very much the same way:

ReqErrorK50SL20HRA.png

Next I ran the tests with Low Rank Accuracy, everything else is the same. Note the vertical scale is different!

This at first was a big surprise:
ReqErrorK50SL11LRA.png

But after thinking about it I realized that this is due to the way we havedefined rank as <= (or <). If we define rank as >= (or >), we should get thesimilar shape as the earlier plots. Running the LRA test with SL=2^17 (Ididn't want to wait an hour for the 2^20 plot!) generated the following:

ReqErrorK50SL17LRA.png

Note that the error doesn't drop to zero except for the +/-1sd contours. Thisis due to the fact that with SL=2^17, the first of 64 evenly placed points islarger than the number of values that are never promoted.

I'm not sure what was intended, but if these results are correct, thisbehavior for the LRA is not very desirable. The Confidence intervals arealmost flat (actually getting smaller at the high ranks, which is counterintuitive) and the only region that is truly "highly accurate" is a very tinyrange at the very low end and there is a very sudden drop of accuracy intothat range. Yes it is true that the error is better than 1% over the entirerange. But I think in practice, this behavior might be problematic.

What I wonder is whether we should be defining rank for the LRA case to be">=" (or ">") ? Either way, we would need to define our error behavior verycarefully!


Cheers,

Lee.

On Tue, Sep 8, 2020 at 5:19 PM leerho <[email protected]<mailto:[email protected]>> wrote:


    I am making good progress on the new Relative Error Quantiles
    
<https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles>
    sketch.  This is in the "RelativeErrorQuantiles" branch, not "master". It
    is not ready to be checked in yet, but it is getting close.  It would be
    wonderful if someone could give it a whirl and/or take a look at the code. 
:)

    Here is what is implemented so far:

      * At sketch construction, it can be configured to prioritize accuracy
        for either the high ranks or the low ranks.
      * The default item comparison criterion is "<", which is consistent with
        our other quantile sketches. However, the user can change the
        criterion to "<=" by simply calling /setLtEq( true ). /This will allow
        comparison with other quantile functions that may use that criterion.
      * I implemented extensive debug and "pretty printing" functions, which
        not only are useful for debugging, but should help anyone visually see
        what is going on internally.  (If anyone is interested in seeing this
        in action, let me know, and I'll send out some instructions  :)   ). 
        I developed a custom "signaling" interface to minimize clutter in the
        main code yet allow for deep analysis of what the code is doing.
      * I custom crafted two versions of a /O(n) /merge sort depending on the
        setting of highRankAccuracy (true/false).  Merge sorting is used
        wherever possible.
      * I also custom crafted a binary search algorithm to automatically
        accommodate the two comparison criteria.
      * I think most of the javadocs are complete, but I'm sure there are
        always ways to improve the documentation.
      * All the license headers are in place
      * Unit test coverage is hovering around 98% (according to my Clover).

    What remains to be completed:

      * The design of the binary storage format
      * Serialization / Deserialization.
      * Comprehensive Characterization of accuracy, speed and stored size as
        functions of /k/ and /N./
      * I could really use some help with code reviews and feedback!  :) :)


    If anyone needs some help in getting it booted up so you can play with it,
    let me know.  Sorry, this is not Python, it does take a few more steps :)

    Also, if any of you are not signed up for our /dev@/ list just send a
    blank email to [email protected]
    <mailto:[email protected]>. The traffic is really low
    so It won't swamp your inbox!

    Cheers,

    Lee.

    On Mon, Sep 7, 2020 at 5:57 AM Pavel Vesely <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Lee,

        okay, let's talk on Thursday. I'll look at your code by then.

        Cheers,
        Pavel

        On 07.09.2020 at 3:40 leerho wrote:
        > Yes, let's meet on Thursday 9:00AM PDT (UTC -7).
        >
        > 1. I am uncomfortable with changing the getMaximumRSE(k) method by a
        factor of
        > 8 ( you suggested 1 or .9) without some rigorous testing or a
        proof.   ( We
        > would lose our ability to say that our bounds are mathematically
        proven :) )
        > We should discuss this.
        >
        > 2. Just as a comment, we always return ranks as normalized ranks
        [0,1] not as
        > counts.  Not a big deal, but when you look at my code you will need
        to know
        > this  :)
        >
        > Please take a look at my code at:
        >
        
https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles
        >
        > Cheers,
        >
        > Lee.
        >
        >
        >
        > On Thu, Sep 3, 2020 at 7:40 AM Vesely, Pavel
        <[email protected] <mailto:[email protected]>
        > <mailto:[email protected]
        <mailto:[email protected]>>> wrote:
        >
        >     Hi Lee,
        >
        >     I finally got back to you. Regarding your summary below, it's
        right that
        >     we don't have a mathematical analysis supporting Strategy 4, and
        there
        >     might be none (although constructing a specific example might be
        tricky).
        >
        >     I've added the remaining functions that you need, namely:
        >
        >     - getMaximumRSE(k): Returns an a priori estimate of relative
        standard
        >     error (RSE, expressed as a number in [0,1]), calculated as
        sqrt(Var) /
        >     rank (note that neither n, nor rank are needed). I'm using a
        formula for
        >     Var from the paper, but it seems too pessimistic (by a factor of 
3),
        >     according to some experiments that I've done.
        >
        >     - getUpperBound(numStdDev) and getLowerBound(numStdDev) that
        provide a
        >     confidence interval by adding/subtracting a specified multiple
        of stdDev
        >     from the estimate
        >
        >     Let me know if you have any questions. I can also join a call
        next week
        >     (Wednesday or Thursday would be the best).
        >
        >     Cheers,
        >     Pavel
        >
        >
        >     On 26.08.2020 at 21:18 leerho wrote:
        >>     Hi Pavel,
        >>     Thanks for your analysis.  My summary of your analysis:
        >>
        >>       * Strategy 1: This would work but performs too many
        compactions. It has
        >>         the simplest prediction of sketch size. The math obviously
        works so
        >>         a priori prediction of sketch error should be relatively
        straightforward.
        >>       * Strategy 2: The math works and has far fewer compactions than
        >>         Strategy1. You didn't comment on the difficulty in a priori
        >>         prediction of sketch size.  Because it is anticipated in
        the paper,
        >>         the a priori prediction of sketch error should also be
        straightforward.
        >>       * Strategy 3:  Comment: your restatement of Region 1 is
        equivalent to
        >>         what I called Region 1 and Region 2. I only separated out
        Region 2 to
        >>         emphasize that that is the size of the overflow. Major
        >>         disadvantages are: the math analysis doesn't support this
        strategy,
        >>         it would have as many compactions as Strategy 1, and may be
        >>         vulnerable to pathological sequences.  So Strategy 3 is
        definitely out!
        >>       * Strategy 4: (If I understood your comments...) the math
        does not
        >>         support this strategy.  So let's leave this one out as well.
        >>
        >>     That leaves only Strategy 2.  Which I will modify the java code 
to
        >>     implement.  This certainly simplifies testing which I am all in
        favor of!
        >>
        >>     Lee.
        >>
        >>
        >>     On Wed, Aug 26, 2020 at 9:22 AM Pavel Vesely
        <[email protected] <mailto:[email protected]>
        >>     <mailto:[email protected]
        <mailto:[email protected]>>> wrote:
        >>
        >>         Hi Lee!
        >>
        >>         Thanks for sending the links. I've done some renaming
        according to you
        >>         suggestions and added two new functions to my code:
        >>
        >>         - quantile(q): given quantile query q in [0,1], it returns
        an approx.
        >>         q-quantile
        >>
        >>         - getMaxStoredItems(k, n): this is a static method that
        estimates the
        >>         (max.
        >>         nominal) size of the sketch with parameter k on input of
        size n. It
        >>         seems to be
        >>         correct up to a 10% error (at least, when the input size
        and k are
        >>         sufficiently
        >>         large, such as k >= 16). I believe a more precise function
        would need to
        >>         simulate constructing the whole sketch ...
        >>
        >>         Regarding the strategies, I have a few comments (some of
        them I said
        >>         during the
        >>         call):
        >>
        >>         *Strategy 1:* Indeed, this strategy performs unnecessarily 
many
        >>         compaction
        >>         operations. The only advantage is an easy predictability of
        the final
        >>         size of
        >>         the sketch, i.e., possibly a more precise getMaxStoredItems
        function.
        >>
        >>         *Strategy 2:* its advantage is that our mathematical
        analysis works
        >>         for sure
        >>         (strictly speaking, the proof works for Strategy 1, but
        there are no
        >>         issues
        >>         introduced by laziness of Strategy 2, and actually, I think
        the more
        >>         general
        >>         analysis of Appendix D applies to Strategy 2 directly).
        >>
        >>         *Strategy 3:* I remember it slightly differently:
        >>
        >>         - Region 1 has Length - numSection * sectionSize items and
        is never
        >>         compacted;
        >>         it possibly contains some part of the "overflow" part
        (above nomCap)
        >>         - Region 2 has numSections * sectionSize largest items,
        split into
        >>         sections of
        >>         size sectionSize, and we choose some number of sections to
        compact
        >>
        >>         This has the disadvantage of compacting just a few items when
        >>         sectionSize is
        >>         small, leading to a larger number of compations (similarly
        as for
        >>         Strategy 1).
        >>         Furthermore, our mathematical analysis does not go through
        directly
        >>         and there
        >>         may be some (contrived) adversarial instances on which it
        has a large
        >>         error.
        >>
        >>         The advantage is that we protect (i.e., not compact) more
        items than
        >>         in the
        >>         previous strategies when the buffer overflow is large.
        >>
        >>         *Strategy 4:* I think Region 1 has size Length / 2, instead
        of nomCap
        >>         / 2. The
        >>         advantage is that when the buffer overflow is large we
        compact more
        >>         than in
        >>         Strategy 3 (although not as many items as in Strategy 2),
        while we
        >>         also protect
        >>         more items than in Strategy 2.
        >>
        >>         The disadvantage is that getMaxStoredItems function would
        be least
        >>         precise
        >>         because sectionSize varies over time (at least, that's my
        intuition).
        >>         As for
        >>         Strategy 3, our mathematical analysis does not go through
        directly
        >>         and there may
        >>         be some adversarial instances on which it has a large
        error, perhaps
        >>         even more
        >>         contrived that in the previous case.
        >>
        >>         So far, I have no clear recommendation, and haven't thought
        much about
        >>         calculating the confidence bounds (will get to it next
        week). I would
        >>         only
        >>         exclude Strategy 1 and Strategy 3.
        >>
        >>         Best,
        >>         Pavel
        >>
        >>         On 25.08.2020 at 23:40 leerho wrote:
        >>         > I forgot to add the links:
        >>         >
        >>         > Main Code:
        >>         >
        >>
        
https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/main/java/org/apache/datasketches/req
        >>         >
        >>         > Test Code:
        >>         >
        >>
        
https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/test/java/org/apache/datasketches/req
        >>         >
        >>         > On Tue, Aug 25, 2020 at 2:10 PM leerho <[email protected]
        <mailto:[email protected]>
        >>         <mailto:[email protected] <mailto:[email protected]>>
        >>         > <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>> wrote:
        >>         >
        >>         >     Folks,
        >>         >
        >>         >     Here are the links to the Java main code and the Java
        test code
        >>         for the
        >>         >     new REQ sketch.  This code is in very initial state
        and is in
        >>         the branch
        >>         >     "RelativeErrorQuantiles".  It has a lot of work left
        before we
        >>         merge it
        >>         >     into master, so recognize it will be changing a lot.
        >>         >
        >>         >     Pavel and I had a long discussion about what the best
        ( & most
        >>         practical)
        >>         >     compaction strategy would be. Whatever strategy we
        choose, we
        >>         need to be
        >>         >     confident that we have mathematics that will support 
the
        >>         implementation
        >>         >     and that we will be able to obtain reasonable
        confidence bounds
        >>         on the
        >>         >     estimates produced by the sketch.  All of these
        strategies
        >>         assume that the
        >>         >     low ranks are the most accurate. (This of course can
        be changed.)
        >>         >
        >>         >     *Strategy 1.* My understanding of the paper is that
        it proposes
        >>         this
        >>         >     strategy:
        >>         >     The compactor has a maximum size given by a Capacity
        = 2 *
        >>         numSections *
        >>         >     sectionSize.  As soon as the compactor reaches
        capacity, the
        >>         compaction
        >>         >     process starts.
        >>         >
        >>         >       * Region1: The bottom half, of size Capacity / 2,
        and is
        >>         never compacted.
        >>         >       * Region2: The top half consists of /m/ sections of
        size /k
        >>         /(these can
        >>         >         change as a f(numCompactions) but ignore for this
        description)
        >>         >           o The top-most of the m sections is always
        compacted,
        >>         leaving m-1
        >>         >             sections to be potentially compacted.
        >>         >           o Of the m-1 sections, choose the top m' of
        these based
        >>         on the
        >>         >             distribution created from the number of
        trailing ones
        >>         in the
        >>         >             number of compactions that have already
        occurred in
        >>         this compactor.
        >>         >
        >>         >     This will create the most number of compactions and
        the most
        >>         levels per
        >>         >     value of N.
        >>         >
        >>         >     *Strategy 2.* This strategy is what Pavel suggested
        in his
        >>         Python code:
        >>         >     The compactor has a Nominal Capacity and an actual
        Length that
        >>         can exceed
        >>         >     the nomCap. The nomCap is defined as above, where
        nomCap = 2 *
        >>         numSections
        >>         >     * sectionSize for each compactor. The sketch has a
        maxSize =
        >>         >     sum(compactor[i].nomCap() ). When this maxSize is
        exceeded the
        >>         >     compression process starts which chooses the lowest
        numbered
        >>         compactor
        >>         >     that exceeds its nomCap to compact.  If this turns
        out to be
        >>         the top
        >>         >     compactor, then a new compactor is added at the top.
        >>         >
        >>         >     By allowing a compactor to exceed its nomCap, we keep
        more
        >>         values at a
        >>         >     lower level improving accuracy. But the issue is what
        to do
        >>         with the
        >>         >     "overflow" items during compaction.  This is addressed
        >>         differently by this
        >>         >     strategy and the next two.
        >>         >
        >>         >       * This compactor has 3 regions:
        >>         >           o Region1: of size nomCap / 2 is never compacted.
        >>         >           o Region2: of size numSections * sectionsSize
        >>         >           o Region3: of size Length - (Region1 +
        Region2);  # this
        >>         is the
        >>         >             "overflow" and is always compacted.
        >>         >       * In this strategy all of Region3 plus the randomly
        chosen,
        >>         top m'
        >>         >         sections of Region 2 will be compacted.
        >>         >
        >>         >
        >>         >     *Strategy 3.*  This strategy is what I have currently
        >>         implemented in Java
        >>         >     (and my version of Pavel's Python) (because I didn't
        fully
        >>         understand what
        >>         >     Pavel had in mind :) ):
        >>         >     This strategy is similar to Strategy 2, and also
        allows for
        >>         overflow. The
        >>         >     way the regions are treated is different:
        >>         >
        >>         >       * This compactor also has 3 regions:
        >>         >           o Region 1 of size nomCap / 2 is never compacted
        >>         >           o Region 2 is the overflow region of size Length 
-
        >>         nomCap. It is
        >>         >             also not compacted.
        >>         >           o Region 3 of size numSections * sectionsSize. 
        The random m'
        >>         >             sections will be compacted based on the
        >>         f(numCompactions) as above.
        >>         >
        >>         >     This keeps more values behind and would maintain better
        >>         accuracy longer,
        >>         >     but perhaps causing more compactions than #2.
        >>         >
        >>         >     *Strategy 4.*  This strategy is similar to #2 and #3,
        but again
        >>         different
        >>         >     in how the overflow values are treated:
        >>         >
        >>         >       * This compactor has 2 regions:
        >>         >           o Region 1 of size nomCap / 2 is never compacted
        >>         >           o Region 2 of size numSections * sectionSize',
        where the
        >>         >             sectionSize' is dynamically adjusted so that
        numSections *
        >>         >             sectionSize' consumes most of Length -
        Region1. The
        >>         number m' is
        >>         >             chosen as before to determine what sections
        to compact.
        >>         >
        >>         >     All of these strategies are implementable and have
        different
        >>         tradeoffs
        >>         >     with respect to accuracy, size and speed.
        >>         >
        >>         >     *Questions:*
        >>         >
        >>         >     1. Which of these strategies can provide:
        >>         >
        >>         >       * An a priori calculation of the sketch size as a
        f(k, N).
        >>         >       * An a priori calculation of estimation error with
        confidence
        >>         intervals.
        >>         >       * An a posteriori calculation of confidence
        intervals based
        >>         on the state
        >>         >         of the sketch.
        >>         >       * A defense that claims that the sketch
        implementation is
        >>         backed by
        >>         >         mathematical proofs of correct operation with
        provable
        >>         error properties.
        >>         >
        >>         >     2.  What is your recommendation?
        >>         >
        >>         >     Note: I may implement all 4 strategies, so that we
        can fully
        >>         characterize
        >>         >     them, but I would prefer if we could limit these
        options to one
        >>         (or two)
        >>         >     to put into production.  :)
        >>         >
        >>         >     Your help with these questions would be appreciated.
        >>         >
        >>         >     Lee.
        >>         >
        >>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Java code for Relative Error Quantiles Sketch

Reply via email to