psfotis opened a new issue, #425:
URL: https://github.com/apache/datasketches-java/issues/425

   Hello,
   
   I am working on a scenario where the items in an input stream are timestamps 
(day granularity). The stream contains 27M such items, but distinct timestamps 
are ~1K (spanning around three years). My goal is to estimate the #items in a 
given time range. To achieve this functionality, I am building a KLL sketch on 
the timestamps of the input stream (internally, these are converted to Unix 
epochs and passed in as floats to the sketch).
   
   To get then the estimate, I get the pmf from the KLL sketch by passing it as 
input the following split points: starting from the min timestamp the delta 
increments are 86400 (seconds in a day). Using the pmf, I can calculate the 
#items in each bin (corresponding to a day) specified by the split points. 
Through this histogram, I can then calculate the #items in a given time range.
   
   My issue with the KLL sketches is that the actual error can be quite large 
in my experiments: for KLL=200, the actual error on a 30-day time range has 
fluctuated between 84.5%-94% in my experiments. Increasing to K=1000, improves 
this performance substantially (96%-99.8% actual error). (Note that by actual 
error, I mean by comparing to the real #items without approximating them).
   
   I wanted to ask whether there is a way to get the error estimation from KLL 
sketches for a query like the above (i.e., #items in a range), and whether such 
errors are expected. 
   
   Happy to provide any more details if necessary. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to