will-lauer commented on issue #144: URL: https://github.com/apache/datasketches-bigquery/issues/144#issuecomment-2718720509
One way to think about it is that information could potentially be lost at multiple places: building the original sketches and each level of merging the sketches. By having a higher lg_k during the merge, you haven't avoided losing info at the first stage (creating the sketches) but you can potentially avoid losing _more_ info at the later stages. This actually results in better error rates in the merge, even though some information has already been lost. The theta sketch accounts for this in its calculation of relative error. If you didn't have enough items in the sketch to trigger additional sampling when doing your merge, you may end up with a merged sketch that isn't completely filled. Having the larger lg_k doesn't help you there. But if you have enough unique values that you do additional sampling as you merge, you end up with a sketch that contains more values, but values that would have been kept by the initial sketches regardless of which lg_k value would have been chosen in the first stage. This means you are actually better off and your relative error has improved even though lg_k was lower when building the initial sketches. The values that the initial sketches discarded are ones that would be discarded anyway by the better lg_k. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org For additional commands, e-mail: dev-h...@datasketches.apache.org