nswamy commented on issue #12768: Disabled: 
test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
URL: https://github.com/apache/incubator-mxnet/pull/12768#issuecomment-431120492
 
 
   I debugged a similar test( a week ago, the issue does not arise(at least for 
the test I ran) when you run it standalone how many every times. It would crash 
only when you ran the entire test suite and fail in 1/10 times. 
   
   There are a couple of possibilities(from my findings): 
   1) There is a leak in the Nvidia drivers. 
   2) There is a leak in the CUDA code that is getting aggregated(running all 
the tests) in the process and hence throwing the error. -- most likely this. 
   
   I ran the all the GPU tests over the weekend(outside docker) and found that 
out of 100 tests, 10 times I get this error, I tend to believe there is a 
memory leak in the CUDA code. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to