psfotis opened a new issue, #425: URL: https://github.com/apache/datasketches-java/issues/425
Hello, I am working on a scenario where the items in an input stream are timestamps (day granularity). The stream contains 27M such items, but distinct timestamps are ~1K (spanning around three years). My goal is to estimate the #items in a given time range. To achieve this functionality, I am building a KLL sketch on the timestamps of the input stream (internally, these are converted to Unix epochs and passed in as floats to the sketch). To get then the estimate, I get the pmf from the KLL sketch by passing it as input the following split points: starting from the min timestamp the delta increments are 86400 (seconds in a day). Using the pmf, I can calculate the #items in each bin (corresponding to a day) specified by the split points. Through this histogram, I can then calculate the #items in a given time range. My issue with the KLL sketches is that the actual error can be quite large in my experiments: for KLL=200, the actual error on a 30-day time range has fluctuated between 84.5%-94% in my experiments. Increasing to K=1000, improves this performance substantially (96%-99.8% actual error). (Note that by actual error, I mean by comparing to the real #items without approximating them). I wanted to ask whether there is a way to get the error estimation from KLL sketches for a query like the above (i.e., #items in a range), and whether such errors are expected. Happy to provide any more details if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
