@zheng-da sometimes, using very large batches, I observe NaN values at the first training iteration. The phenomenon is much more frequent when I use more OMP threads and the network is large. For example, if I use more than 20 OMP threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 10% of the times.
I think this could be due to a race condition when allocating/copying MKLDNN memory. Do you think it makes sense? Do you know what functions I should try monitor to find the root of the problem? Thanks, Al [ Full content available at: https://github.com/apache/incubator-mxnet/issues/10865 ] This message was relayed via gitbox.apache.org for [email protected]
