[GitHub] [incubator-mxnet] khui commented on issue #14799: training model failed after one epoch on GPU: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

GitBox Mon, 29 Apr 2019 11:33:36 -0700

khui commented on issue #14799: training model failed after one epoch on GPU: e 
== CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED
URL: 
https://github.com/apache/incubator-mxnet/issues/14799#issuecomment-487693286
 
 
   @mirocody Thanks!
   
   The errors appear when I am using DLAMI. To debug, I ran docker container to 
exclude the reasons that the bugs come from the mismatched mxnet/cuda/cudnn 
version. Since such reasons seem unlikely after trying different combinations, 
I switched back to use DLAMI. The container is ran using following command, 
thereafter some commands are ran inside the container as usual. 
   `nvidia-docker run --rm -it --name gpu_run -v 
/home/ec2-user/workspace/output:/workdir/output mxnet_gpu bash`
   
   The jupyer notebook is ran on mxnet_p36 per the suggestions from 
@lanking520.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] khui commented on issue #14799: training model failed after one epoch on GPU: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

Reply via email to