leerho commented on issue #425:
URL: 
https://github.com/apache/datasketches-java/issues/425#issuecomment-1411001020

   Fotis, after some internal discussion, we think that part of the problem 
here may be a misunderstanding of how to apply the single-sided and 
double-sided error numbers that are shown in the tables in 
[KLLAccuracyAndSize](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).
  (We probably need to add some text describing this in more detail, which I 
will attempt to do here.)  To simplify our explanation, lets assume some of the 
key parameters of your experiment:
   -  KLL sketch, K=200
   - 27M input items (dates)
   - 1000 evenly distributed Split Points (SP). (day dates over a period of 
about 3 years)
   - single-sided absolute error of 1.33%
   - double-sided absolute error of 1.65%
   
   First of all, as stated on that page the error is _absolute (or additive) 
rank_ error as opposed to  _multiplicative (or relative)_ error.  This is 
important!
   
   We use normalized ranks, which means the representation of a rank is a 
fraction between 0 and 1.0 or a percent between 0% and 100%.  A rank of 0.5 
represents the median of all 27M items as if they were ordered.
   
   **Single-sided error query example:**
   The query r1 = getRank(q1), where q1 happened to be the median date, then r1 
= 0.5 +/- 0.0133.
   
   **Double-sided error query example:**
   Now add the query r2 = getRank(q1), where q2 happened to correspond to the 
90th percentile, then r2= 0.9 +/- 0.0133.
   
   But as soon as we subtract (r2-r1), the absolute error increases, because we 
are subtracting two random variables, to 0.0165.  The error calculation is now 
(r2-r1) => 0.4 +/- 0.0165.  
   
   When performing a single query the effective resolution is about 1/0.0133 or 
about 75 SP.
   When doing differences between two SP we have an effective resolution of 
1/0.0165 about 60 SP.
   
   By specifying 1000 SP, you may be implying to yourself that you want the 
sketch to resolve an error of .001, which is way lower than the noise threshold 
of the configured sketch at K=200.  There is no harm in specifying this large 
number of SP, but just don't kid yourself by querying two points that are too 
close together!   Asking the sketch to produce the PMF histogram of all 1000 SP 
will surely result in a very noisy histogram that may not be useful at all.
    
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to