[GitHub] al-rigazzi commented on issue #10865: A potential race condition in the executor or engine.

GitBox Wed, 26 Sep 2018 03:36:45 -0700

al-rigazzi commented on issue #10865: A potential race condition in the 
executor or engine.
URL: 
https://github.com/apache/incubator-mxnet/issues/10865#issuecomment-424668146
 
 
   @zheng-da sometimes, using very large batches, I observe NaN values at the 
first training iteration. The phenomenon is much more frequent when I use more 
OMP threads and the network is large. For example, if I use more than 20 OMP 
threads with VGG 16 and 1024 samples per batch (on a single node), I get NaN's 
10% of the times.
   
   I think this could be due to a race condition when allocating/copying MKLDNN 
memory. Do you think it makes sense? Do you know what functions I should try 
monitor to find the root of the problem?
   
   Thanks,
   Al


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] al-rigazzi commented on issue #10865: A potential race condition in the executor or engine.

Reply via email to