will-lauer commented on issue #144:
URL: 
https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2718720509

   One way to think about it is that information could potentially be lost at 
multiple places: building the original sketches and each level of merging the 
sketches. By having a higher lg_k during the merge, you haven't avoided losing 
info at the first stage (creating the sketches) but you can potentially avoid 
losing _more_ info at the later stages. This actually results in better error 
rates in the merge, even though some information has already been lost. 
   
   The theta sketch accounts for this in its calculation of relative error. If 
you didn't have enough items in the sketch to trigger additional sampling when 
doing your merge, you may end up with a merged sketch that isn't completely 
filled. Having the larger lg_k doesn't help you there. But if you have enough 
unique values that you do additional sampling as you merge, you end up with a 
sketch that contains more values, but values that would have been kept by the 
initial sketches regardless of which lg_k value would have been chosen in the 
first stage. This means you are actually better off and your relative error has 
improved even though lg_k was lower when building the initial sketches. The 
values that the initial sketches discarded are ones that would be discarded 
anyway by the better lg_k.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org
For additional commands, e-mail: dev-h...@datasketches.apache.org

Reply via email to