c-dickens commented on PR #71: URL: https://github.com/apache/datasketches-rust/pull/71#issuecomment-3754086591
To answer the question directly: this example of scaling up or down by a **nonnegative** value is acceptable. However, there is a knock-on effect on the error bound which needs testing. Basically, for an underlying frequency value `f_i` of item `i` the error bound is `true(f_i) <= est(f_i) <= true(f_i) + error` where error depends on the sketch size and the total mass in the stream. So if you are scale the inputs to the sketch (that make up each "true" `f_i`), then you also need to scale the `total_weight` - this is implemented in the PR but the altered error bounds are not tested. In future, we do need a little care because there is some subtlety around the behaviour depending on the input stream. If only **positive** integers are used for addition and multiplication **and** the underlying stream only has positive frequencies, then things are ok. The nuance comes in around negative values in the underlying frequency vector, in which case we might prefer a `CountSketch` -- something useful that we haven't got round to yet! There are some good notes [here](https://fuhuthu.com/BigData2022/countmin.pdf) and [here](https://www.cs.rice.edu/~as143/COMP480_580_Spring20/scribe/lect10.pdf) that make the point clear. The "standard" count min point query estimation is only valid on the assumption that all entries in the underlying frequency vector are nonnegative. Happy to give more details if need be. In reference to @tisonkun's comment, there are a bunch of windowing/merging/aggregate approaches that might be interested. If so please add to the [discussion](https://github.com/apache/datasketches-rust/discussions/64) - I am very keen to learn if you have specific applications in mind, or whether they (and other sketches eg. CountSketch) would be generally useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
