@zheng-da sometimes, using very large batches, I observe NaN values at the 
first training iteration. The phenomenon is much more frequent when I use more 
OMP threads and the network is large. For example, if I use more than 20 OMP 
threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 
10% of the times.

I think this could be due to a race condition when allocating/copying MKLDNN 
memory. Do you think it makes sense? Do you know what functions I should try 
monitor to find the root of the problem?

Thanks,
Al

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/10865 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to