RAMitchell commented on issue #361: URL: https://github.com/apache/datasketches-cpp/issues/361#issuecomment-1518699030
> I am afraid you did not quite explain why is it necessary to repeat exactly. What happens if it is a bit different? Here is an article that discusses reproducibility issues in ml: https://blog.ml.cmu.edu/2020/08/31/5-reproducibility/ Also, I think many users find it difficult to evaluate the relative impact of changes to training hyper-parameters on eval set performance when changes are also occurring due to randomness. Having a seed allows them to deal with these changes separately. I think you could argue ML folks are doing things wrong but it does seem to be a common requirement we get. So much so that we rebuilt the xgboost librarys gpu algorithms to be deterministic even with with floating point atomics. I am prepared to carefully seed each sketch instance differently, I agree this is a potential pitfall but think HPC developers are used to dealing with this issue for things like random number generation or monte carlo simulations. It's relatively easy to ensure the data is fed to the sketches in the same order. The sketches would be merged in a tree reduce type pattern which is also deterministic. We would be running this code on the legate framework so as long as it partitions the data to workers in the same way on repeated runs, these conditions will be met. We will never be able to guarantee determinism on different partitionings but that is fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
