leerho commented on issue #425: URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1411001020
Fotis, after some internal discussion, we think that part of the problem here may be a misunderstanding of how to apply the single-sided and double-sided error numbers that are shown in the tables in [KLLAccuracyAndSize](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html). (We probably need to add some text describing this in more detail, which I will attempt to do here.) To simplify our explanation, lets assume some of the key parameters of your experiment: - KLL sketch, K=200 - 27M input items (dates) - 1000 evenly distributed Split Points (SP). (day dates over a period of about 3 years) - single-sided absolute error of 1.33% - double-sided absolute error of 1.65% First of all, as stated on that page the error is _absolute (or additive) rank_ error as opposed to _multiplicative (or relative)_ error. This is important! We use normalized ranks, which means the representation of a rank is a fraction between 0 and 1.0 or a percent between 0% and 100%. A rank of 0.5 represents the median of all 27M items as if they were ordered. **Single-sided error query example:** The query r1 = getRank(q1), where q1 happened to be the median date, then r1 = 0.5 +/- 0.0133. **Double-sided error query example:** Now add the query r2 = getRank(q1), where q2 happened to correspond to the 90th percentile, then r2= 0.9 +/- 0.0133. But as soon as we subtract (r2-r1), the absolute error increases, because we are subtracting two random variables, to 0.0165. The error calculation is now (r2-r1) => 0.4 +/- 0.0165. When performing a single query the effective resolution is about 1/0.0133 or about 75 SP. When doing differences between two SP we have an effective resolution of 1/0.0165 about 60 SP. By specifying 1000 SP, you may be implying to yourself that you want the sketch to resolve an error of .001, which is way lower than the noise threshold of the configured sketch at K=200. There is no harm in specifying this large number of SP, but just don't kid yourself by querying two points that are too close together! Asking the sketch to produce the PMF histogram of all 1000 SP will surely result in a very noisy histogram that may not be useful at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
