thomasrebele commented on issue #693: URL: https://github.com/apache/datasketches-java/issues/693#issuecomment-3671099867
Thank you for your suggestion. The place where the KLL sketches are merged in Hive is a regular Java function, so the order of the sketches can be enforced. I expect the number of sketches to be merged to be less than one million. The Hive project also indirectly uses the KLL sketch results in the various `EXPLAIN` commands in the q.out files. The q.out files are compared with the expected version. If the results are not stable, then there's the possibility to mask the results. However, if the purpose is to check whether the statistics have been calculated correctly, then masking them does not help. Switching to comparing the results by allowing a certain uncertainty opens many other questions: how to evaluate the uncertainty? How to define the threshold when the comparison should fail? How to avoid the problem of flaky tests? Hive's test when creating a PR take several hours on a cluster, and it is quite annoying if an unrelated test fails due to a reason unrelated to the PR. I would even go as far to say it would be nice to have an (additional!) deterministic update method to facilitate the testing in Hive. Please be reassured: I want to KEEP the existing behavior. I just want to add another method to make it possible to allow 3rd-party libraries to overwrite the RNG in case of need. The javadoc of the new methods should clearly state that the caller is responsible to evaluate whether it is safe to provide their own RNG. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
