Hey Kellen, I really appreciate that. Thank you!
And thanks to the community for supporting me ^^ Per On Wed, Apr 10, 2019 at 5:53 AM kellen sunderland < [email protected]> wrote: > Hey Per, just wanted to drop a line and say thanks for supporting the > community on this one. > > On Tue, Apr 9, 2019 at 4:20 AM Per da Silva <[email protected]> wrote: > > > I've created an issue to track this problem: > > https://github.com/apache/incubator-mxnet/issues/14652 > > > > On Tue, Apr 9, 2019 at 9:07 AM Per da Silva <[email protected]> > wrote: > > > > > Dear MXNet community, > > > > > > I've been trying to update the CI GPU images to CUDA 10, but the tests > > are > > > failing. I'm not sure why and would really appreciate some help =D > > > > > > I've managed, at least, to narrow down the problem to the cuDNN > version. > > > The current CUDA 10 images uses cuDNN version 7.5.0.56 ( > > > > > > https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile > > > ). > > > > > > I noticed that the binary in the python packages we release uses cuDNN > > > 7.3.1.20 ( > > > > > > https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34 > > ), > > > so decided to create a PR with CI updated to CUDA 10 with cuDNN > 7.3.1.20 > > > and sure enough the tests passed ( > > > https://github.com/apache/incubator-mxnet/pull/14513). > > > > > > After talking with another contributer, we decided that I would try to > > > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing > tests > > > (to be fixed later). But, it seems the problem is a bit more heinous. I > > > disable one test, and another one fails...So, it might make sense to > > reach > > > out now and see if we can find the root cause and fix it. > > > > > > Some things I've sanity checked: > > > > > > We run the tests on g3.8xlarge instances. These instances contain Tesla > > > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 > > supports > > > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA > ). > > > > > > According to the cuDNN support matrix ( > > > > https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html > > ), > > > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver > > r410.48 > > > (I assume greater or equal). > > > > > > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73. > > > > > > So, as best I can tell, our environment ought to support cuDNN 7.5, > which > > > leads me to conclude that maybe there's something wrong in the code. > > > > > > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check > failed: > > > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH". > > > > > > According to the cuDNN user guide ( > > > > > > https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html > > > ): > > > > > > CUDNN_STATUS_ARCH_MISMATCH > > > > > > The function requires a feature absent from the current GPU device. > Note > > > that cuDNN only supports devices with compute capabilities greater than > > or > > > equal to 3.0. > > > > > > To correct: compile and run the application on a device with > appropriate > > > compute capability. > > > > > > But, as we've seen, our environment seems to support this version of > > cuDNN > > > and other versions go through CI w/o any problem... > > > > > > You can see some logs here: > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/ > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/ > > > > > > I have about 13 runs of this pipeline. The errors for different runs > can > > > be seen by changing the number before /pipeline (e.g. > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/ > > > < > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/ > > > > for > > > the 2nd run, etc.) > > > > > > Thanks in advance for the help! > > > > > > You can reach me here or on Slack if you have any questions =D > > > > > > Cheers, > > > > > > Per > > > > > > P.S. I'm attaching some instructions on how to reproduce the issue at > > home > > > (or at least on a g3.8xlarge instance running ubuntu 16.04). > > > > > >
