MrRaghav commented on pull request #16487: URL: https://github.com/apache/incubator-mxnet/pull/16487#issuecomment-653756605
Hello, I am still getting this error while using mxnet with sockeye. Since, this was fixed in the new release, I didn't open a new bug. Please find the details in following points: 1) I'm using following versions of mxnet and sockeye 2.1.7 (on CUDA 10.1) [username]@[server]:~/username/sockeye/dir1$ pip3 list | grep mxnet mxnet 1.6.0 **mxnet-cu101mkl 1.6.0** mxnet-mkl 1.6.0 [username]@[server]:~/username/sockeye/dir1$ pip3 list | grep sockeye **sockeye 2.1.7** 2) When I run the **sockeye.train** command with arguments, I get following log: _[username]@[server]:~/username/sockeye$ tail -30 77233.out File "/home/username/.local/lib/python3.7/site-packages/sockeye/train.py", line 997, in <module> main() File "/home/username/.local/lib/python3.7/site-packages/sockeye/train.py", line 764, in main train(args) File "/home/username/.local/lib/python3.7/site-packages/sockeye/train.py", line 992, in train training_state = trainer.fit(train_iter=train_iter, validation_iter=eval_iter, checkpoint_decoder=cp_decoder) File "/home/username/.local/lib/python3.7/site-packages/sockeye/training.py", line 242, in fit_ self._step(batch=train_iter.next()) File "/home/username/.local/lib/python3.7/site-packages/sockeye/training.py", line 346, in _step loss_func.metric.update(loss_value.asscalar(), num_samples.asscalar()) File "/home/username/.local/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2553, in asscalar return self.asnumpy()[0] File "/home/username/.local/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2535, in asnumpy ctypes.c_size_t(data.size))) File "/home/username/.local/lib/python3.7/site-packages/mxnet/base.py", line 255, in check_call_ raise MXNetError(py_str(_LIB.MXGetLastError())) **mxnet.base.MXNetError: [09:58:26] src/storage/./pooled_storage_manager.h:161: cudaMalloc retry failed: out of memory** Stack trace: [bt] (0) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6d554b) [0x7f6c5b3d054b] [bt] (1) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x41a0c72) [0x7f6c5ee9bc72] [bt] (2) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x41a694f) [0x7f6c5eea194f] [bt] (3) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3972e10) [0x7f6c5e66de10] [bt] (4) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x39730c7) [0x7f6c5e66e0c7] [bt] (5) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x281) [0x7f6c5e66e4d1] [bt] (6) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3896f19) [0x7f6c5e591f19] [bt] (7) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a3c31) [0x7f6c5e59ec31] [bt] (8) /home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a7170) [0x7f6c5e5a2170]_ **learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.** 3) Also, I can see another error "cudaMalloc retry failed: out of memory" in above log and I checked https://github.com/deepinsight/insightface/issues/257 to find a fix. They've mentioned that reducing the batches solves the issue but I am not using any such argument in **sockeye.train** 4) The arguments used with sockeye are as follows: _python3 -m sockeye.train -d training_data \ -vs dev.BPE.de \ -vt dev.BPE.en \ --shared-vocab \ -o parallel/wmt_model_ 5) I found the code https://mxnet.apache.org/api/python/docs/_modules/mxnet/optimizer/optimizer.html which says learning_rate should be assigned to self.lr_scheduler.base_lr with above warning. But, I am getting it as an error and the output comes as failed. 6) Moreover, I checked the release notes of mxnet 1.6.0 from below link and can see that this issue has been fixed. https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes#id-1.6.0Releasenotes-Bugfixes I hope I didn't miss anything before coming to you. can you please suggest what should be done in such scenario? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org