thomasrebele commented on issue #693: URL: https://github.com/apache/datasketches-java/issues/693#issuecomment-3597778915
Thank you @AlexanderSaydakov and @leerho for the feedback. I understand that KLL is probabilistic in nature. Probabilistic updates (feeding the data to a single KLL sketch) are also not my issue. I'm happy with the non-determinism there. However, I still need to get a deterministic result for my use case: merging `n` KLL sketches to a single one. I agree that miss-using the random number generator (RNG) might lead to bad sketches: if the same seed is used for every `KllSketch#merge(KllSketch, Random)` call, then the errors add up and become quite large. However, there's a way around this: When the RNG is initialized with a fixed seed at the beginning and re-used for the merge operations, then the error seems to be the same as the original method that uses the RNG KllSketch#random. I've prepared some experiments based on the proposed PR. It seems to me that the deterministic merge is still good enough to be used in my use case, as the errors are very similar to the errors of the original merge method. Please see https://github.com/thomasrebele/datasketches-java/commit/8abbddfea4e2dddc6849e6fe7c44dcd83dae17a5 for the results and the code. I'm not sure whether I measured the normalized rank error correctly, as I could not find the exact definition. I'm happy to adapt the code if you point me to its definition. Would you have a look at my experiment, please? I'm happy to extend it if you think more experiments are necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
