RAMitchell commented on issue #361:
URL: 
https://github.com/apache/datasketches-cpp/issues/361#issuecomment-1518699030

   > I am afraid you did not quite explain why is it necessary to repeat 
exactly. What happens if it is a bit different?
   
   Here is an article that discusses reproducibility issues in ml: 
https://blog.ml.cmu.edu/2020/08/31/5-reproducibility/
   
   Also, I think many users find it difficult to evaluate the relative impact 
of changes to training hyper-parameters on eval set performance when changes 
are also occurring due to randomness. Having a seed allows them to deal with 
these changes separately.
   
   I think you could argue ML folks are doing things wrong but it does seem to 
be a common requirement we get. So much so that we rebuilt the xgboost librarys 
gpu algorithms to be deterministic even with with floating point atomics.
   
   I am prepared to carefully seed each sketch instance differently, I agree 
this is a potential pitfall but think HPC developers are used to dealing with 
this issue for things like random number generation or monte carlo simulations.
   
   It's relatively easy to ensure the data is fed to the sketches in the same 
order. The sketches would be merged in a tree reduce type pattern which is also 
deterministic. We would be running this code on the legate framework so as long 
as it partitions the data to workers in the same way on repeated runs, these 
conditions will be met.
   
   We will never be able to guarantee determinism on different partitionings 
but that is fine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to