al-rigazzi commented on issue #10865: A potential race condition in the 
executor or engine.
URL: 
https://github.com/apache/incubator-mxnet/issues/10865#issuecomment-424668146
 
 
   @zheng-da sometimes, using very large batches, I observe NaN values at the 
first training iteration. The phenomenon is much more frequent when I use more 
OMP threads and the network is large. For example, if I use more than 20 OMP 
threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 
10% of the times.
   
   I think this could be due to a race condition when allocating/copying MKLDNN 
memory. Do you think it makes sense? Do you know what functions I should try 
monitor to find the root of the problem?
   
   Thanks,
   Al

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to