Folks,
Here are some of the first characterization results.
Test Design:
* Stream Lengths (SL), 2^11, 2^14, 2^17, 2^20
* Stream values, natural numbers from 1 to SL, expressed as floats (32 bits).
* X-axis Trial Points: 64 evenly spaced values between 0 and SL, where the
top value is always SL. (tp[])
* Generate the true ranks of the 64 values: (tr[])
* Num Trials: 2^12 = 4096.
* Run the 4096 Trials; For each Trial:
o Shuffle the stream
o Feed the stream into the ReqSketch (sk)
o perform a bulk query: r[] = sk.getRanks(tp[]);
o for each rank compute the relative error: re[i] = r[i] / tr[i] - 1;
o feed the re[] into an array of 64 error quantiles sketches (KLL or the
older Quantiles sketch): Qerr[].
* End of Trials
*
* From each of the 64 Qerr[] do getQuantiles(G[]), where G[] is an array of
6 Gaussian ranks corresponding to +/-1SD, +/-2SD, Median, and 1.0. These
values are plotted for all 64 tr[] values along the x-axis.
First I configured the ReqSketch with High Rank Accuracy, ">=" comparison
criteria, and K=50. It produced the following plot:
ReqErrorK50SL11HRA.png
This is actually what I expected! a very smooth transition from the high error
of the low ranks to the excellent error for the high ranks. For the very low
ranks the error shoots off ultimately to infinity as those values may have
disappeared from the sketch.
The 3 curves above zero represent +1 sd, +2 sd, and 1.0. The 2 curves below
are -1 sd and -2 sd.
What is nice is that +/-2 sd represents a 95.4% CI and that starts at ranks
0.2 and above. Running this same test with 2^20 values produces the next
plot, which behaves very much the same way:
ReqErrorK50SL20HRA.png
Next I ran the tests with Low Rank Accuracy, everything else is the same.
Note the vertical scale is different!
This at first was a big surprise:
ReqErrorK50SL11LRA.png
But after thinking about it I realized that this is due to the way we have
defined rank as <= (or <). If we define rank as >= (or >), we should get the
similar shape as the earlier plots. Running the LRA test with SL=2^17 (I
didn't want to wait an hour for the 2^20 plot!) generated the following:
ReqErrorK50SL17LRA.png
Note that the error doesn't drop to zero except for the +/-1sd contours. This
is due to the fact that with SL=2^17, the first of 64 evenly placed points is
larger than the number of values that are never promoted.
I'm not sure what was intended, but if these results are correct, this
behavior for the LRA is not very desirable. The Confidence intervals are
almost flat (actually getting smaller at the high ranks, which is counter
intuitive) and the only region that is truly "highly accurate" is a very tiny
range at the very low end and there is a very sudden drop of accuracy into
that range. Yes it is true that the error is better than 1% over the entire
range. But I think in practice, this behavior might be problematic.
What I wonder is whether we should be defining rank for the LRA case to be
">=" (or ">") ? Either way, we would need to define our error behavior very
carefully!
Cheers,
Lee.
On Tue, Sep 8, 2020 at 5:19 PM leerho <[email protected]
<mailto:[email protected]>> wrote:
I am making good progress on the new Relative Error Quantiles
<https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles>
sketch. This is in the "RelativeErrorQuantiles" branch, not "master". It
is not ready to be checked in yet, but it is getting close. It would be
wonderful if someone could give it a whirl and/or take a look at the code.
:)
Here is what is implemented so far:
* At sketch construction, it can be configured to prioritize accuracy
for either the high ranks or the low ranks.
* The default item comparison criterion is "<", which is consistent with
our other quantile sketches. However, the user can change the
criterion to "<=" by simply calling /setLtEq( true ). /This will allow
comparison with other quantile functions that may use that criterion.
* I implemented extensive debug and "pretty printing" functions, which
not only are useful for debugging, but should help anyone visually see
what is going on internally. (If anyone is interested in seeing this
in action, let me know, and I'll send out some instructions :) ).
I developed a custom "signaling" interface to minimize clutter in the
main code yet allow for deep analysis of what the code is doing.
* I custom crafted two versions of a /O(n) /merge sort depending on the
setting of highRankAccuracy (true/false). Merge sorting is used
wherever possible.
* I also custom crafted a binary search algorithm to automatically
accommodate the two comparison criteria.
* I think most of the javadocs are complete, but I'm sure there are
always ways to improve the documentation.
* All the license headers are in place
* Unit test coverage is hovering around 98% (according to my Clover).
What remains to be completed:
* The design of the binary storage format
* Serialization / Deserialization.
* Comprehensive Characterization of accuracy, speed and stored size as
functions of /k/ and /N./
* I could really use some help with code reviews and feedback! :) :)
If anyone needs some help in getting it booted up so you can play with it,
let me know. Sorry, this is not Python, it does take a few more steps :)
Also, if any of you are not signed up for our /dev@/ list just send a
blank email to [email protected]
<mailto:[email protected]>. The traffic is really low
so It won't swamp your inbox!
Cheers,
Lee.
On Mon, Sep 7, 2020 at 5:57 AM Pavel Vesely <[email protected]
<mailto:[email protected]>> wrote:
Hi Lee,
okay, let's talk on Thursday. I'll look at your code by then.
Cheers,
Pavel
On 07.09.2020 at 3:40 leerho wrote:
> Yes, let's meet on Thursday 9:00AM PDT (UTC -7).
>
> 1. I am uncomfortable with changing the getMaximumRSE(k) method by a
factor of
> 8 ( you suggested 1 or .9) without some rigorous testing or a
proof. ( We
> would lose our ability to say that our bounds are mathematically
proven :) )
> We should discuss this.
>
> 2. Just as a comment, we always return ranks as normalized ranks
[0,1] not as
> counts. Not a big deal, but when you look at my code you will need
to know
> this :)
>
> Please take a look at my code at:
>
https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles
>
> Cheers,
>
> Lee.
>
>
>
> On Thu, Sep 3, 2020 at 7:40 AM Vesely, Pavel
<[email protected] <mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>> wrote:
>
> Hi Lee,
>
> I finally got back to you. Regarding your summary below, it's
right that
> we don't have a mathematical analysis supporting Strategy 4, and
there
> might be none (although constructing a specific example might be
tricky).
>
> I've added the remaining functions that you need, namely:
>
> - getMaximumRSE(k): Returns an a priori estimate of relative
standard
> error (RSE, expressed as a number in [0,1]), calculated as
sqrt(Var) /
> rank (note that neither n, nor rank are needed). I'm using a
formula for
> Var from the paper, but it seems too pessimistic (by a factor of
3),
> according to some experiments that I've done.
>
> - getUpperBound(numStdDev) and getLowerBound(numStdDev) that
provide a
> confidence interval by adding/subtracting a specified multiple
of stdDev
> from the estimate
>
> Let me know if you have any questions. I can also join a call
next week
> (Wednesday or Thursday would be the best).
>
> Cheers,
> Pavel
>
>
> On 26.08.2020 at 21:18 leerho wrote:
>> Hi Pavel,
>> Thanks for your analysis. My summary of your analysis:
>>
>> * Strategy 1: This would work but performs too many
compactions. It has
>> the simplest prediction of sketch size. The math obviously
works so
>> a priori prediction of sketch error should be relatively
straightforward.
>> * Strategy 2: The math works and has far fewer compactions than
>> Strategy1. You didn't comment on the difficulty in a priori
>> prediction of sketch size. Because it is anticipated in
the paper,
>> the a priori prediction of sketch error should also be
straightforward.
>> * Strategy 3: Comment: your restatement of Region 1 is
equivalent to
>> what I called Region 1 and Region 2. I only separated out
Region 2 to
>> emphasize that that is the size of the overflow. Major
>> disadvantages are: the math analysis doesn't support this
strategy,
>> it would have as many compactions as Strategy 1, and may be
>> vulnerable to pathological sequences. So Strategy 3 is
definitely out!
>> * Strategy 4: (If I understood your comments...) the math
does not
>> support this strategy. So let's leave this one out as well.
>>
>> That leaves only Strategy 2. Which I will modify the java code
to
>> implement. This certainly simplifies testing which I am all in
favor of!
>>
>> Lee.
>>
>>
>> On Wed, Aug 26, 2020 at 9:22 AM Pavel Vesely
<[email protected] <mailto:[email protected]>
>> <mailto:[email protected]
<mailto:[email protected]>>> wrote:
>>
>> Hi Lee!
>>
>> Thanks for sending the links. I've done some renaming
according to you
>> suggestions and added two new functions to my code:
>>
>> - quantile(q): given quantile query q in [0,1], it returns
an approx.
>> q-quantile
>>
>> - getMaxStoredItems(k, n): this is a static method that
estimates the
>> (max.
>> nominal) size of the sketch with parameter k on input of
size n. It
>> seems to be
>> correct up to a 10% error (at least, when the input size
and k are
>> sufficiently
>> large, such as k >= 16). I believe a more precise function
would need to
>> simulate constructing the whole sketch ...
>>
>> Regarding the strategies, I have a few comments (some of
them I said
>> during the
>> call):
>>
>> *Strategy 1:* Indeed, this strategy performs unnecessarily
many
>> compaction
>> operations. The only advantage is an easy predictability of
the final
>> size of
>> the sketch, i.e., possibly a more precise getMaxStoredItems
function.
>>
>> *Strategy 2:* its advantage is that our mathematical
analysis works
>> for sure
>> (strictly speaking, the proof works for Strategy 1, but
there are no
>> issues
>> introduced by laziness of Strategy 2, and actually, I think
the more
>> general
>> analysis of Appendix D applies to Strategy 2 directly).
>>
>> *Strategy 3:* I remember it slightly differently:
>>
>> - Region 1 has Length - numSection * sectionSize items and
is never
>> compacted;
>> it possibly contains some part of the "overflow" part
(above nomCap)
>> - Region 2 has numSections * sectionSize largest items,
split into
>> sections of
>> size sectionSize, and we choose some number of sections to
compact
>>
>> This has the disadvantage of compacting just a few items when
>> sectionSize is
>> small, leading to a larger number of compations (similarly
as for
>> Strategy 1).
>> Furthermore, our mathematical analysis does not go through
directly
>> and there
>> may be some (contrived) adversarial instances on which it
has a large
>> error.
>>
>> The advantage is that we protect (i.e., not compact) more
items than
>> in the
>> previous strategies when the buffer overflow is large.
>>
>> *Strategy 4:* I think Region 1 has size Length / 2, instead
of nomCap
>> / 2. The
>> advantage is that when the buffer overflow is large we
compact more
>> than in
>> Strategy 3 (although not as many items as in Strategy 2),
while we
>> also protect
>> more items than in Strategy 2.
>>
>> The disadvantage is that getMaxStoredItems function would
be least
>> precise
>> because sectionSize varies over time (at least, that's my
intuition).
>> As for
>> Strategy 3, our mathematical analysis does not go through
directly
>> and there may
>> be some adversarial instances on which it has a large
error, perhaps
>> even more
>> contrived that in the previous case.
>>
>> So far, I have no clear recommendation, and haven't thought
much about
>> calculating the confidence bounds (will get to it next
week). I would
>> only
>> exclude Strategy 1 and Strategy 3.
>>
>> Best,
>> Pavel
>>
>> On 25.08.2020 at 23:40 leerho wrote:
>> > I forgot to add the links:
>> >
>> > Main Code:
>> >
>>
https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/main/java/org/apache/datasketches/req
>> >
>> > Test Code:
>> >
>>
https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/test/java/org/apache/datasketches/req
>> >
>> > On Tue, Aug 25, 2020 at 2:10 PM leerho <[email protected]
<mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>
>> > <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>> wrote:
>> >
>> > Folks,
>> >
>> > Here are the links to the Java main code and the Java
test code
>> for the
>> > new REQ sketch. This code is in very initial state
and is in
>> the branch
>> > "RelativeErrorQuantiles". It has a lot of work left
before we
>> merge it
>> > into master, so recognize it will be changing a lot.
>> >
>> > Pavel and I had a long discussion about what the best
( & most
>> practical)
>> > compaction strategy would be. Whatever strategy we
choose, we
>> need to be
>> > confident that we have mathematics that will support
the
>> implementation
>> > and that we will be able to obtain reasonable
confidence bounds
>> on the
>> > estimates produced by the sketch. All of these
strategies
>> assume that the
>> > low ranks are the most accurate. (This of course can
be changed.)
>> >
>> > *Strategy 1.* My understanding of the paper is that
it proposes
>> this
>> > strategy:
>> > The compactor has a maximum size given by a Capacity
= 2 *
>> numSections *
>> > sectionSize. As soon as the compactor reaches
capacity, the
>> compaction
>> > process starts.
>> >
>> > * Region1: The bottom half, of size Capacity / 2,
and is
>> never compacted.
>> > * Region2: The top half consists of /m/ sections of
size /k
>> /(these can
>> > change as a f(numCompactions) but ignore for this
description)
>> > o The top-most of the m sections is always
compacted,
>> leaving m-1
>> > sections to be potentially compacted.
>> > o Of the m-1 sections, choose the top m' of
these based
>> on the
>> > distribution created from the number of
trailing ones
>> in the
>> > number of compactions that have already
occurred in
>> this compactor.
>> >
>> > This will create the most number of compactions and
the most
>> levels per
>> > value of N.
>> >
>> > *Strategy 2.* This strategy is what Pavel suggested
in his
>> Python code:
>> > The compactor has a Nominal Capacity and an actual
Length that
>> can exceed
>> > the nomCap. The nomCap is defined as above, where
nomCap = 2 *
>> numSections
>> > * sectionSize for each compactor. The sketch has a
maxSize =
>> > sum(compactor[i].nomCap() ). When this maxSize is
exceeded the
>> > compression process starts which chooses the lowest
numbered
>> compactor
>> > that exceeds its nomCap to compact. If this turns
out to be
>> the top
>> > compactor, then a new compactor is added at the top.
>> >
>> > By allowing a compactor to exceed its nomCap, we keep
more
>> values at a
>> > lower level improving accuracy. But the issue is what
to do
>> with the
>> > "overflow" items during compaction. This is addressed
>> differently by this
>> > strategy and the next two.
>> >
>> > * This compactor has 3 regions:
>> > o Region1: of size nomCap / 2 is never compacted.
>> > o Region2: of size numSections * sectionsSize
>> > o Region3: of size Length - (Region1 +
Region2); # this
>> is the
>> > "overflow" and is always compacted.
>> > * In this strategy all of Region3 plus the randomly
chosen,
>> top m'
>> > sections of Region 2 will be compacted.
>> >
>> >
>> > *Strategy 3.* This strategy is what I have currently
>> implemented in Java
>> > (and my version of Pavel's Python) (because I didn't
fully
>> understand what
>> > Pavel had in mind :) ):
>> > This strategy is similar to Strategy 2, and also
allows for
>> overflow. The
>> > way the regions are treated is different:
>> >
>> > * This compactor also has 3 regions:
>> > o Region 1 of size nomCap / 2 is never compacted
>> > o Region 2 is the overflow region of size Length
-
>> nomCap. It is
>> > also not compacted.
>> > o Region 3 of size numSections * sectionsSize.
The random m'
>> > sections will be compacted based on the
>> f(numCompactions) as above.
>> >
>> > This keeps more values behind and would maintain better
>> accuracy longer,
>> > but perhaps causing more compactions than #2.
>> >
>> > *Strategy 4.* This strategy is similar to #2 and #3,
but again
>> different
>> > in how the overflow values are treated:
>> >
>> > * This compactor has 2 regions:
>> > o Region 1 of size nomCap / 2 is never compacted
>> > o Region 2 of size numSections * sectionSize',
where the
>> > sectionSize' is dynamically adjusted so that
numSections *
>> > sectionSize' consumes most of Length -
Region1. The
>> number m' is
>> > chosen as before to determine what sections
to compact.
>> >
>> > All of these strategies are implementable and have
different
>> tradeoffs
>> > with respect to accuracy, size and speed.
>> >
>> > *Questions:*
>> >
>> > 1. Which of these strategies can provide:
>> >
>> > * An a priori calculation of the sketch size as a
f(k, N).
>> > * An a priori calculation of estimation error with
confidence
>> intervals.
>> > * An a posteriori calculation of confidence
intervals based
>> on the state
>> > of the sketch.
>> > * A defense that claims that the sketch
implementation is
>> backed by
>> > mathematical proofs of correct operation with
provable
>> error properties.
>> >
>> > 2. What is your recommendation?
>> >
>> > Note: I may implement all 4 strategies, so that we
can fully
>> characterize
>> > them, but I would prefer if we could limit these
options to one
>> (or two)
>> > to put into production. :)
>> >
>> > Your help with these questions would be appreciated.
>> >
>> > Lee.
>> >
>>