I am making good progress on the new Relative Error Quantiles <https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles> sketch. This is in the "RelativeErrorQuantiles" branch, not "master". It is not ready to be checked in yet, but it is getting close. It would be wonderful if someone could give it a whirl and/or take a look at the code. :)
Here is what is implemented so far: - At sketch construction, it can be configured to prioritize accuracy for either the high ranks or the low ranks. - The default item comparison criterion is "<", which is consistent with our other quantile sketches. However, the user can change the criterion to "<=" by simply calling *setLtEq( true ). *This will allow comparison with other quantile functions that may use that criterion. - I implemented extensive debug and "pretty printing" functions, which not only are useful for debugging, but should help anyone visually see what is going on internally. (If anyone is interested in seeing this in action, let me know, and I'll send out some instructions :) ). I developed a custom "signaling" interface to minimize clutter in the main code yet allow for deep analysis of what the code is doing. - I custom crafted two versions of a *O(n) *merge sort depending on the setting of highRankAccuracy (true/false). Merge sorting is used wherever possible. - I also custom crafted a binary search algorithm to automatically accommodate the two comparison criteria. - I think most of the javadocs are complete, but I'm sure there are always ways to improve the documentation. - All the license headers are in place - Unit test coverage is hovering around 98% (according to my Clover). What remains to be completed: - The design of the binary storage format - Serialization / Deserialization. - Comprehensive Characterization of accuracy, speed and stored size as functions of *k* and *N.* - I could really use some help with code reviews and feedback! :) :) If anyone needs some help in getting it booted up so you can play with it, let me know. Sorry, this is not Python, it does take a few more steps :) Also, if any of you are not signed up for our *dev@* list just send a blank email to [email protected]. The traffic is really low so It won't swamp your inbox! Cheers, Lee. On Mon, Sep 7, 2020 at 5:57 AM Pavel Vesely <[email protected]> wrote: > Hi Lee, > > okay, let's talk on Thursday. I'll look at your code by then. > > Cheers, > Pavel > > On 07.09.2020 at 3:40 leerho wrote: > > Yes, let's meet on Thursday 9:00AM PDT (UTC -7). > > > > 1. I am uncomfortable with changing the getMaximumRSE(k) method by a > factor of > > 8 ( you suggested 1 or .9) without some rigorous testing or a proof. ( > We > > would lose our ability to say that our bounds are mathematically proven > :) ) > > We should discuss this. > > > > 2. Just as a comment, we always return ranks as normalized ranks [0,1] > not as > > counts. Not a big deal, but when you look at my code you will need to > know > > this :) > > > > Please take a look at my code at: > > > https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles > > > > Cheers, > > > > Lee. > > > > > > > > On Thu, Sep 3, 2020 at 7:40 AM Vesely, Pavel <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hi Lee, > > > > I finally got back to you. Regarding your summary below, it's right > that > > we don't have a mathematical analysis supporting Strategy 4, and > there > > might be none (although constructing a specific example might be > tricky). > > > > I've added the remaining functions that you need, namely: > > > > - getMaximumRSE(k): Returns an a priori estimate of relative standard > > error (RSE, expressed as a number in [0,1]), calculated as sqrt(Var) > / > > rank (note that neither n, nor rank are needed). I'm using a formula > for > > Var from the paper, but it seems too pessimistic (by a factor of 3), > > according to some experiments that I've done. > > > > - getUpperBound(numStdDev) and getLowerBound(numStdDev) that provide > a > > confidence interval by adding/subtracting a specified multiple of > stdDev > > from the estimate > > > > Let me know if you have any questions. I can also join a call next > week > > (Wednesday or Thursday would be the best). > > > > Cheers, > > Pavel > > > > > > On 26.08.2020 at 21:18 leerho wrote: > >> Hi Pavel, > >> Thanks for your analysis. My summary of your analysis: > >> > >> * Strategy 1: This would work but performs too many compactions. > It has > >> the simplest prediction of sketch size. The math obviously > works so > >> a priori prediction of sketch error should be relatively > straightforward. > >> * Strategy 2: The math works and has far fewer compactions than > >> Strategy1. You didn't comment on the difficulty in a priori > >> prediction of sketch size. Because it is anticipated in the > paper, > >> the a priori prediction of sketch error should also be > straightforward. > >> * Strategy 3: Comment: your restatement of Region 1 is > equivalent to > >> what I called Region 1 and Region 2. I only separated out > Region 2 to > >> emphasize that that is the size of the overflow. Major > >> disadvantages are: the math analysis doesn't support this > strategy, > >> it would have as many compactions as Strategy 1, and may be > >> vulnerable to pathological sequences. So Strategy 3 is > definitely out! > >> * Strategy 4: (If I understood your comments...) the math does not > >> support this strategy. So let's leave this one out as well. > >> > >> That leaves only Strategy 2. Which I will modify the java code to > >> implement. This certainly simplifies testing which I am all in > favor of! > >> > >> Lee. > >> > >> > >> On Wed, Aug 26, 2020 at 9:22 AM Pavel Vesely < > [email protected] > >> <mailto:[email protected]>> wrote: > >> > >> Hi Lee! > >> > >> Thanks for sending the links. I've done some renaming according > to you > >> suggestions and added two new functions to my code: > >> > >> - quantile(q): given quantile query q in [0,1], it returns an > approx. > >> q-quantile > >> > >> - getMaxStoredItems(k, n): this is a static method that > estimates the > >> (max. > >> nominal) size of the sketch with parameter k on input of size > n. It > >> seems to be > >> correct up to a 10% error (at least, when the input size and k > are > >> sufficiently > >> large, such as k >= 16). I believe a more precise function > would need to > >> simulate constructing the whole sketch ... > >> > >> Regarding the strategies, I have a few comments (some of them I > said > >> during the > >> call): > >> > >> *Strategy 1:* Indeed, this strategy performs unnecessarily many > >> compaction > >> operations. The only advantage is an easy predictability of the > final > >> size of > >> the sketch, i.e., possibly a more precise getMaxStoredItems > function. > >> > >> *Strategy 2:* its advantage is that our mathematical analysis > works > >> for sure > >> (strictly speaking, the proof works for Strategy 1, but there > are no > >> issues > >> introduced by laziness of Strategy 2, and actually, I think the > more > >> general > >> analysis of Appendix D applies to Strategy 2 directly). > >> > >> *Strategy 3:* I remember it slightly differently: > >> > >> - Region 1 has Length - numSection * sectionSize items and is > never > >> compacted; > >> it possibly contains some part of the "overflow" part (above > nomCap) > >> - Region 2 has numSections * sectionSize largest items, split > into > >> sections of > >> size sectionSize, and we choose some number of sections to > compact > >> > >> This has the disadvantage of compacting just a few items when > >> sectionSize is > >> small, leading to a larger number of compations (similarly as > for > >> Strategy 1). > >> Furthermore, our mathematical analysis does not go through > directly > >> and there > >> may be some (contrived) adversarial instances on which it has a > large > >> error. > >> > >> The advantage is that we protect (i.e., not compact) more items > than > >> in the > >> previous strategies when the buffer overflow is large. > >> > >> *Strategy 4:* I think Region 1 has size Length / 2, instead of > nomCap > >> / 2. The > >> advantage is that when the buffer overflow is large we compact > more > >> than in > >> Strategy 3 (although not as many items as in Strategy 2), while > we > >> also protect > >> more items than in Strategy 2. > >> > >> The disadvantage is that getMaxStoredItems function would be > least > >> precise > >> because sectionSize varies over time (at least, that's my > intuition). > >> As for > >> Strategy 3, our mathematical analysis does not go through > directly > >> and there may > >> be some adversarial instances on which it has a large error, > perhaps > >> even more > >> contrived that in the previous case. > >> > >> So far, I have no clear recommendation, and haven't thought > much about > >> calculating the confidence bounds (will get to it next week). I > would > >> only > >> exclude Strategy 1 and Strategy 3. > >> > >> Best, > >> Pavel > >> > >> On 25.08.2020 at 23:40 leerho wrote: > >> > I forgot to add the links: > >> > > >> > Main Code: > >> > > >> > https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/main/java/org/apache/datasketches/req > >> > > >> > Test Code: > >> > > >> > https://github.com/apache/incubator-datasketches-java/tree/RelativeErrorQuantiles/src/test/java/org/apache/datasketches/req > >> > > >> > On Tue, Aug 25, 2020 at 2:10 PM leerho <[email protected] > >> <mailto:[email protected]> > >> > <mailto:[email protected] <mailto:[email protected]>>> wrote: > >> > > >> > Folks, > >> > > >> > Here are the links to the Java main code and the Java > test code > >> for the > >> > new REQ sketch. This code is in very initial state and > is in > >> the branch > >> > "RelativeErrorQuantiles". It has a lot of work left > before we > >> merge it > >> > into master, so recognize it will be changing a lot. > >> > > >> > Pavel and I had a long discussion about what the best ( & > most > >> practical) > >> > compaction strategy would be. Whatever strategy we > choose, we > >> need to be > >> > confident that we have mathematics that will support the > >> implementation > >> > and that we will be able to obtain reasonable confidence > bounds > >> on the > >> > estimates produced by the sketch. All of these strategies > >> assume that the > >> > low ranks are the most accurate. (This of course can be > changed.) > >> > > >> > *Strategy 1.* My understanding of the paper is that it > proposes > >> this > >> > strategy: > >> > The compactor has a maximum size given by a Capacity = 2 * > >> numSections * > >> > sectionSize. As soon as the compactor reaches capacity, > the > >> compaction > >> > process starts. > >> > > >> > * Region1: The bottom half, of size Capacity / 2, and is > >> never compacted. > >> > * Region2: The top half consists of /m/ sections of > size /k > >> /(these can > >> > change as a f(numCompactions) but ignore for this > description) > >> > o The top-most of the m sections is always > compacted, > >> leaving m-1 > >> > sections to be potentially compacted. > >> > o Of the m-1 sections, choose the top m' of these > based > >> on the > >> > distribution created from the number of trailing > ones > >> in the > >> > number of compactions that have already occurred > in > >> this compactor. > >> > > >> > This will create the most number of compactions and the > most > >> levels per > >> > value of N. > >> > > >> > *Strategy 2.* This strategy is what Pavel suggested in his > >> Python code: > >> > The compactor has a Nominal Capacity and an actual Length > that > >> can exceed > >> > the nomCap. The nomCap is defined as above, where nomCap > = 2 * > >> numSections > >> > * sectionSize for each compactor. The sketch has a > maxSize = > >> > sum(compactor[i].nomCap() ). When this maxSize is > exceeded the > >> > compression process starts which chooses the lowest > numbered > >> compactor > >> > that exceeds its nomCap to compact. If this turns out to > be > >> the top > >> > compactor, then a new compactor is added at the top. > >> > > >> > By allowing a compactor to exceed its nomCap, we keep more > >> values at a > >> > lower level improving accuracy. But the issue is what to > do > >> with the > >> > "overflow" items during compaction. This is addressed > >> differently by this > >> > strategy and the next two. > >> > > >> > * This compactor has 3 regions: > >> > o Region1: of size nomCap / 2 is never compacted. > >> > o Region2: of size numSections * sectionsSize > >> > o Region3: of size Length - (Region1 + Region2); # > this > >> is the > >> > "overflow" and is always compacted. > >> > * In this strategy all of Region3 plus the randomly > chosen, > >> top m' > >> > sections of Region 2 will be compacted. > >> > > >> > > >> > *Strategy 3.* This strategy is what I have currently > >> implemented in Java > >> > (and my version of Pavel's Python) (because I didn't fully > >> understand what > >> > Pavel had in mind :) ): > >> > This strategy is similar to Strategy 2, and also allows > for > >> overflow. The > >> > way the regions are treated is different: > >> > > >> > * This compactor also has 3 regions: > >> > o Region 1 of size nomCap / 2 is never compacted > >> > o Region 2 is the overflow region of size Length - > >> nomCap. It is > >> > also not compacted. > >> > o Region 3 of size numSections * sectionsSize. The > random m' > >> > sections will be compacted based on the > >> f(numCompactions) as above. > >> > > >> > This keeps more values behind and would maintain better > >> accuracy longer, > >> > but perhaps causing more compactions than #2. > >> > > >> > *Strategy 4.* This strategy is similar to #2 and #3, but > again > >> different > >> > in how the overflow values are treated: > >> > > >> > * This compactor has 2 regions: > >> > o Region 1 of size nomCap / 2 is never compacted > >> > o Region 2 of size numSections * sectionSize', > where the > >> > sectionSize' is dynamically adjusted so that > numSections * > >> > sectionSize' consumes most of Length - Region1. > The > >> number m' is > >> > chosen as before to determine what sections to > compact. > >> > > >> > All of these strategies are implementable and have > different > >> tradeoffs > >> > with respect to accuracy, size and speed. > >> > > >> > *Questions:* > >> > > >> > 1. Which of these strategies can provide: > >> > > >> > * An a priori calculation of the sketch size as a f(k, > N). > >> > * An a priori calculation of estimation error with > confidence > >> intervals. > >> > * An a posteriori calculation of confidence intervals > based > >> on the state > >> > of the sketch. > >> > * A defense that claims that the sketch implementation > is > >> backed by > >> > mathematical proofs of correct operation with provable > >> error properties. > >> > > >> > 2. What is your recommendation? > >> > > >> > Note: I may implement all 4 strategies, so that we can > fully > >> characterize > >> > them, but I would prefer if we could limit these options > to one > >> (or two) > >> > to put into production. :) > >> > > >> > Your help with these questions would be appreciated. > >> > > >> > Lee. > >> > > >> >
