al-rigazzi commented on issue #10865: A potential race condition in the executor or engine. URL: https://github.com/apache/incubator-mxnet/issues/10865#issuecomment-424668146 @zheng-da sometimes, using very large batches, I observe NaN values at the first training iteration. The phenomenon is much more frequent when I use more OMP threads and the network is large. For example, if I use more than 20 OMP threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 10% of the times. I think this could be due to a race condition when allocating/copying MKLDNN memory. Do you think it makes sense? Do you know what functions I should try monitor to find the root of the problem? Thanks, Al
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
