zhechen opened a new issue #9823: RCNN example fails for using latest mxnet
URL: https://github.com/apache/incubator-mxnet/issues/9823
 
 
   I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. 
However, when I re-run the rcnn code in the example, I got the following error:
   
   Traceback (most recent call last):
     File "train_end2end.py", line 199, in <module>
       main()
     File "train_end2end.py", line 196, in main
       lr=args.lr, lr_step=args.lr_step)
     File "train_end2end.py", line 158, in train_net
       arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, 
num_epoch=end_epoch)
     File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 
496, in fit
       self.update_metric(eval_metric, data_batch.label)
     File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric
       self._curr_module.update_metric(eval_metric, labels)
     File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, 
in update_metric
       self._exec_group.update_metric(eval_metric, labels)
     File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", 
line 616, in update_metric
       eval_metric.update_dict(labels_, preds)
     File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in 
update_dict
       metric.update_dict(labels, preds)
     File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in 
update_dict
       self.update(label, pred)
     File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update
       pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
     File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 
1801, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [17:08:44] 
src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == 
CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM
   
   Stack trace returned 10 entries:
   [bt] (0) 
/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d)
 [0x2adc0c3395cd]
   [bt] (1) 
/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18)
 [0x2adc0c339a58]
   [bt] (2) 
/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext
 const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, 
mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669]
   [bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void 
mxnet::op::SoftmaxActivationGradCompute<mshadow::gpu>(nnvm::NodeAttrs const&, 
mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> 
> const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > 
const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xd4c) 
[0x2adc0f5c2eac]
   [bt] (4) 
/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext,
 bool)+0x50) [0x2adc0ec4cc40]
   [bt] (5) 
/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) 
[0x2adc0ec54653]
   [bt] 
(6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprBlock*)+0x2c4) [0x2adc0ec2fcd4]
   [bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void 
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context,
 bool, 
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
 std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent> const&)+0x103) 
[0x2adc0ec34253]
   [bt] (8) 
/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void
 (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#3}::operator()() 
const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)+0x3e) 
[0x2adc0ec3448e]
   [bt] 
(9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void
 (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> 
(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x3b) 
[0x2adc0ec2e36b]
   
   
   Can anyone help me with it? Thanks very much!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to