c-dickens commented on PR #71:
URL: https://github.com/apache/datasketches-rust/pull/71#issuecomment-3754086591

   To answer the question directly: this example of scaling up or down by a 
**nonnegative** value is acceptable.  However, there is a knock-on effect on 
the error bound which needs testing.  Basically, for an underlying frequency 
value `f_i` of item `i` the error bound is `true(f_i) <= est(f_i) <= true(f_i) 
+ error` where error depends on the sketch size and the total mass in the 
stream.  So if you are scale the inputs to the sketch (that make up each "true" 
`f_i`), then you also need to scale the `total_weight` - this is implemented in 
the PR but the altered error bounds are not tested. 
   
   In future, we do need a little care because there is some subtlety around 
the behaviour depending on the input stream.  If only **positive** integers are 
used for addition and multiplication **and** the underlying stream only has 
positive frequencies, then things are ok.  The nuance comes in around negative 
values in the underlying frequency vector, in which case we might prefer a 
`CountSketch` -- something useful that we haven't got round to yet!
   
   There are some good notes 
[here](https://fuhuthu.com/BigData2022/countmin.pdf) and 
[here](https://www.cs.rice.edu/~as143/COMP480_580_Spring20/scribe/lect10.pdf) 
that make the point clear.  The "standard" count min point query estimation is 
only valid on the assumption that all entries in the underlying frequency 
vector are nonnegative.  Happy to give more details if need be.
   
   In reference to @tisonkun's comment, there are a bunch of 
windowing/merging/aggregate approaches that might be interested.  If so please 
add to the 
[discussion](https://github.com/apache/datasketches-rust/discussions/64) - I am 
very keen to learn if you have specific applications in mind, or whether they 
(and other sketches eg. CountSketch) would be generally useful.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to