roywei edited a comment on issue #15152: [CI][nightly] nightly test tutorial 
failure: test_tutorials.test_python_kvstore
URL: 
https://github.com/apache/incubator-mxnet/issues/15152#issuecomment-498938781
 
 
   This tutorial test was passing when running  on 1 GPU machine.
   
https://github.com/apache/incubator-mxnet/blob/master/docs/tutorials/python/kvstore.md
   ```
   # The numbers used below assume 4 GPUs
   gpus = mx.context.num_gpus()
   if gpus > 0:
       contexts = [mx.gpu(i) for i in range(gpus)]
   else:
       contexts = [mx.cpu(i) for i in range(4)]
   ```
   However, when I changed to P3 instances with 4 gpus in 
https://github.com/apache/incubator-mxnet/pull/15141. it fails.
   ```
   
   
   MXNetError: [01:12:52] src/imperative/./imperative_utils.h:71: Check failed: 
inputs[i]->ctx().dev_mask() == ctx.dev_mask() (1 vs. 2) : Operator 
broadcast_add require all inputs live on the same context. But the first 
argument is on gpu(0) while the 2-th argument is on cpu(0)
   
   Stack trace:
   
     [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3c)
 [0x7f08e7052c1c]
   
     [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::GetContext(nnvm::NodeAttrs
 const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, 
std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, 
mxnet::Context const&)+0x823) [0x7f08e9fbf343]
   
     [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context
 const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&)+0xdb) [0x7f08e9fcd47b]
   
     [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, 
int, void**, int*, void***, int, char const**, char const**)+0x1c9) 
[0x7f08eaab99d9]
   
     [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8f) 
[0x7f08eaab9edf]
   
     [bt] (5) 
/usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c)
 [0x7f093764ae20]
   
     [bt] (6) 
/usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb)
 [0x7f093764a88b]
   
     [bt] (7) 
/usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a)
 [0x7f093764501a]
   
     [bt] (8) 
/usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) 
[0x7f0937638fcb]
   
   ```
   
   error comes from this part of the code during broad_cast add, where `stored` 
is `b` and on GPU, `input` is `mx.nd.ones(shape)` on CPU. but it should not 
give an error.
   ```
   def update(key, input, stored):
       print("update on key: %d" % key)
       stored += input * 2
   kv._set_updater(update)
   kv.pull(3, out=a)
   print(a.asnumpy())
   
   kv.push(3, mx.nd.ones(shape))
   #
   kv.pull(3, out=a)
   print(a.asnumpy())
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to