karan6181 commented on issue #19631: URL: https://github.com/apache/incubator-mxnet/issues/19631#issuecomment-747128363
**Update:** Commenting out this line of code (https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L705-L710) seems to work with Horovod `v0.21.0`, `mxnet-cu101==1.7.0` and `gluoncv==0.8.0`. However, running the same script without horovod fails with different issue which is mentioned below: ``` INFO:root:[Epoch 0 Iteration 0] Set learning rate to 1e-05 [00:26:11] src/imperative/./cached_op.h:257: Disabling fusion due to altered topological order of inputs. [00:26:12] src/imperative/./cached_op.h:257: Disabling fusion due to altered topological order of inputs. Exception in thread Thread-7: Traceback (most recent call last): File "/shared/mx_oob_env/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/shared/mx_oob_env/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/utils/parallel.py", line 105, in _worker out = parallel.forward_backward(x) File "/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/mask_rcnn/data_parallel.py", line 48, in forward_backward cls_targets, box_targets, box_masks, indices = self.net(data, gt_box, gt_label) File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 747, in __call__ out = self.forward(*args) File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1309, in forward return self._call_cached_op(x, *args) File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 1093, in _call_cached_op out = self._cached_op(*cargs) File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/_ctypes/ndarray.py", line 148, in __call__ check_call(_LIB.MXInvokeCachedOpEx( File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): File "src/imperative/cached_op.cc", line 777 MXNetError: Check failed: inputs[i]->ctx() == default_ctx (gpu(0) vs. gpu(1)) : CachedOp requires all inputs to live on the same context. But data0 is on gpu(1) while maskrcnn0_normalizedperclassboxcenterencoder0_means is on gpu(0) ``` - **Conclusion:** Manually casting the model to FP16 doesn't work in MXNet-cu101 1.7.0, however, it is working with MXNet-cu101mkl 1.6.0. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
