zhechen opened a new issue #9823: RCNN example fails for using latest mxnet URL: https://github.com/apache/incubator-mxnet/issues/9823 I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. However, when I re-run the rcnn code in the example, I got the following error: Traceback (most recent call last): File "train_end2end.py", line 199, in <module> main() File "train_end2end.py", line 196, in main lr=args.lr, lr_step=args.lr_step) File "train_end2end.py", line 158, in train_net arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch) File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit self.update_metric(eval_metric, data_batch.label) File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric self._curr_module.update_metric(eval_metric, labels) File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric self._exec_group.update_metric(eval_metric, labels) File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric eval_metric.update_dict(labels_, preds) File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in update_dict metric.update_dict(labels, preds) File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict self.update(label, pred) File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32') File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1801, in asnumpy ctypes.c_size_t(data.size))) File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [17:08:44] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM Stack trace returned 10 entries: [bt] (0) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x2adc0c3395cd] [bt] (1) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x2adc0c339a58] [bt] (2) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669] [bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradCompute<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xd4c) [0x2adc0f5c2eac] [bt] (4) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x2adc0ec4cc40] [bt] (5) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) [0x2adc0ec54653] [bt] (6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2c4) [0x2adc0ec2fcd4] [bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent> const&)+0x103) [0x2adc0ec34253] [bt] (8) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)+0x3e) [0x2adc0ec3448e] [bt] (9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x3b) [0x2adc0ec2e36b] Can anyone help me with it? Thanks very much!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services