karan6181 commented on issue #19631:
URL: 
https://github.com/apache/incubator-mxnet/issues/19631#issuecomment-747128363


   **Update:** Commenting out this line of code 
(https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L705-L710)
 seems to work with Horovod  `v0.21.0`, `mxnet-cu101==1.7.0` and 
`gluoncv==0.8.0`. However, running the same script without horovod fails with 
different issue which is mentioned below:
   
   ```
   INFO:root:[Epoch 0 Iteration 0] Set learning rate to 1e-05
   [00:26:11] src/imperative/./cached_op.h:257: Disabling fusion due to altered 
topological order of inputs.
   [00:26:12] src/imperative/./cached_op.h:257: Disabling fusion due to altered 
topological order of inputs.
   Exception in thread Thread-7:
   Traceback (most recent call last):
     File "/shared/mx_oob_env/lib/python3.8/threading.py", line 932, in 
_bootstrap_inner
       self.run()
     File "/shared/mx_oob_env/lib/python3.8/threading.py", line 870, in run
       self._target(*self._args, **self._kwargs)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/utils/parallel.py", 
line 105, in _worker
       out = parallel.forward_backward(x)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/mask_rcnn/data_parallel.py",
 line 48, in forward_backward
       cls_targets, box_targets, box_masks, indices = self.net(data, gt_box, 
gt_label)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 
747, in __call__
       out = self.forward(*args)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 
1309, in forward
       return self._call_cached_op(x, *args)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 
1093, in _call_cached_op
       out = self._cached_op(*cargs)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/_ctypes/ndarray.py", line 
148, in __call__
       check_call(_LIB.MXInvokeCachedOpEx(
     File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/base.py", line 
246, in check_call
       raise get_last_ffi_error()
   mxnet.base.MXNetError: Traceback (most recent call last):
     File "src/imperative/cached_op.cc", line 777
   MXNetError: Check failed: inputs[i]->ctx() == default_ctx (gpu(0) vs. 
gpu(1)) : CachedOp requires all inputs to live on the same context. But data0 
is on gpu(1) while maskrcnn0_normalizedperclassboxcenterencoder0_means is on 
gpu(0)
   ```
   
   - **Conclusion:** Manually casting the model to FP16 doesn't work in 
MXNet-cu101 1.7.0, however, it is working with MXNet-cu101mkl 1.6.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to