[GitHub] [incubator-mxnet] larroy opened a new issue #16951: CentOS GPU tests failing in master

GitBox Fri, 29 Nov 2019 23:06:11 -0800

larroy opened a new issue #16951: CentOS GPU tests failing in master
URL: https://github.com/apache/incubator-mxnet/issues/16951
 
 
   ## Description
   
   Centos GPU tests are failing in master:
   
   
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/master/1341/
   
   
   I couldn't reproduce in p3 instance over ubuntu 18.04. Trying in the CI AMI 
now.
   
   Seems to be a problem in the base AMI, reproduced by running the following 
commands:
   
   ```
   time ci/build.py --docker-registry mxnetci --platform centos7_gpu 
--docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh 
build_centos7_gpu
   time ci/build.py --docker-registry mxnetci --nvidiadocker --platform 
centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh 
unittest_centos7_gpu
   
   ```
   
   Failure is:
   
   ```
   [07:03:53] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [07:03:59] 
/work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:107: Check failed: err 
== CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
   Stack trace:
     [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b)
 [0x7f0376aa865b]
     [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void 
mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0x227) 
[0x7f037aa308e7]
     [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>* 
mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x244) [0x7f037aa30e14]
     [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void 
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context,
 bool, 
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
 std::shared_ptr<dmlc::ManualEvent> const&)+0x19f) [0x7f037aa513ef]
     [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#4}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7f037aa51626]
     [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void
 (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> 
>::_M_run()+0x44) [0x7f037aa3d1c4]
     [bt] (6) /usr/lib64/libstdc++.so.6(+0xb5070) [0x7f03e2478070]
     [bt] (7) /usr/lib64/libpthread.so.0(+0x7e65) [0x7f03f4f92e65]
     [bt] (8) /usr/lib64/libc.so.6(clone+0x6d) [0x7f03f45b288d]
   
   
   /work/runtime_functions.sh: line 1312:     6 Aborted                 (core 
dumped) python3.6 -m "nose" $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS 
--with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
   2019-11-30 07:03:59,955 - root - INFO - Waiting for status of container 
ea33d765417a for 600 s.
   2019-11-30 07:04:00,117 - root - INFO - Container exit status: 
{'StatusCode': 134, 'Error': None}
   2019-11-30 07:04:00,117 - root - ERROR - Container exited with an error 😞
   2019-11-30 07:04:00,117 - root - INFO - Executed command for reproduction:
   
   ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu 
--docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh 
unittest_centos7_gpu
   
   2019-11-30 07:04:00,117 - root - INFO - Stopping container: ea33d765417a
   2019-11-30 07:04:00,119 - root - INFO - Removing container: ea33d765417a
   2019-11-30 07:04:00,140 - root - CRITICAL - Execution of 
['/work/runtime_functions.sh', 'unittest_centos7_gpu'] failed with status: 134
   
   ```
   
   A solution would be to update the AMI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] larroy opened a new issue #16951: CentOS GPU tests failing in master

Reply via email to