nswamy commented on issue #12768: Disabled: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm URL: https://github.com/apache/incubator-mxnet/pull/12768#issuecomment-431120492 I debugged a similar test( a week ago, the issue does not arise(at least for the test I ran) when you run it standalone how many every times. It would crash only when you ran the entire test suite and fail in 1/10 times. There are a couple of possibilities(from my findings): 1) There is a leak in the Nvidia drivers. 2) There is a leak in the CUDA code that is getting aggregated(running all the tests) in the process and hence throwing the error. -- most likely this. I ran the all the GPU tests over the weekend(outside docker) and found that out of 100 tests, 10 times I get this error, I tend to believe there is a memory leak in the CUDA code.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
