MrRaghav commented on pull request #16487:
URL: https://github.com/apache/incubator-mxnet/pull/16487#issuecomment-653756605


   Hello,
   I am still getting this error while using mxnet with sockeye. Since, this 
was fixed in the new release, I didn't open a new bug.
   Please find the details in following points:
   
   1) I'm using following versions of mxnet and sockeye 2.1.7 (on CUDA 10.1) 
       [username]@[server]:~/username/sockeye/dir1$ pip3 list | grep mxnet
       mxnet               1.6.0
       **mxnet-cu101mkl      1.6.0**
       mxnet-mkl           1.6.0
       [username]@[server]:~/username/sockeye/dir1$ pip3 list | grep sockeye
       **sockeye             2.1.7**
   
   2) When I run the **sockeye.train** command with arguments, I get following 
log:
   
      _[username]@[server]:~/username/sockeye$ tail -30 77233.out
     File "/home/username/.local/lib/python3.7/site-packages/sockeye/train.py", 
line 997, in <module>
       main()
     File "/home/username/.local/lib/python3.7/site-packages/sockeye/train.py", 
line 764, in main
       train(args)
     File "/home/username/.local/lib/python3.7/site-packages/sockeye/train.py", 
line 992, in train
       training_state = trainer.fit(train_iter=train_iter, 
validation_iter=eval_iter, checkpoint_decoder=cp_decoder)
     File 
"/home/username/.local/lib/python3.7/site-packages/sockeye/training.py", line 
242, in fit_
       self._step(batch=train_iter.next())
     File 
"/home/username/.local/lib/python3.7/site-packages/sockeye/training.py", line 
346, in _step
       loss_func.metric.update(loss_value.asscalar(), num_samples.asscalar())
     File 
"/home/username/.local/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", 
line 2553, in asscalar
       return self.asnumpy()[0]
     File 
"/home/username/.local/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", 
line 2535, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/home/username/.local/lib/python3.7/site-packages/mxnet/base.py", 
line 255, in check_call_
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   **mxnet.base.MXNetError: [09:58:26] 
src/storage/./pooled_storage_manager.h:161: cudaMalloc retry failed: out of 
memory**
   Stack trace:
     [bt] (0) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6d554b) 
[0x7f6c5b3d054b]
     [bt] (1) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x41a0c72) 
[0x7f6c5ee9bc72]
     [bt] (2) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x41a694f) 
[0x7f6c5eea194f]
     [bt] (3) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3972e10) 
[0x7f6c5e66de10]
     [bt] (4) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x39730c7) 
[0x7f6c5e66e0c7]
     [bt] (5) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void
 (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, 
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, 
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, 
std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > 
const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) 
const+0x281) [0x7f6c5e66e4d1]
     [bt] (6) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3896f19) 
[0x7f6c5e591f19]
     [bt] (7) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a3c31) 
[0x7f6c5e59ec31]
     [bt] (8) 
/home/username/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x38a7170) 
[0x7f6c5e5a2170]_
   
   
     **learning rate from ``lr_scheduler`` has been overwritten by 
``learning_rate`` in optimizer.**
   
   
   3) Also, I can see another error "cudaMalloc retry failed: out of memory" in 
above log and I checked https://github.com/deepinsight/insightface/issues/257 
to find a fix. They've mentioned that reducing the batches solves the issue but 
I am not using any such argument in **sockeye.train**
   
   4) The arguments used with sockeye are as follows:
       _python3 -m sockeye.train -d training_data \
                           -vs dev.BPE.de \
                           -vt dev.BPE.en \
                           --shared-vocab \
                           -o parallel/wmt_model_
   
   5) I found the code 
https://mxnet.apache.org/api/python/docs/_modules/mxnet/optimizer/optimizer.html
 which says learning_rate should be assigned to self.lr_scheduler.base_lr with 
above warning. But, I am getting it as an error and the output comes as failed.
       
   6) Moreover, I checked the release notes of mxnet 1.6.0 from below link and 
can see that this issue has been fixed.
       
https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes#id-1.6.0Releasenotes-Bugfixes
   
   I hope I didn't miss anything before coming to you. can you please suggest 
what should be done in such scenario?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to