ptrendx commented on pull request #19209: URL: https://github.com/apache/incubator-mxnet/pull/19209#issuecomment-698635453
Ok, I understand now the problem with reproducibility I saw - `cudnnSetDropoutDescriptor` is asynchronous and there was no proper synchronization of the CUDA stream, so if `cudnnSetDropoutDescriptor` was picked up by 1 thread and the dropout was picked up by another thread, there was race condition on the CUDA side. I fixed that in the latest commit. I still believe that there is a potential problem for resource assignment, although that is not something that would be typically hit as the ops are launched typically from a single thread. The thread-safe cachedop would be the main reason for this to fail, although it would require somebody to seed frequently during the execution, so that is also not very common scenario. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
