roywei edited a comment on issue #15152: [CI][nightly] nightly test tutorial failure: test_tutorials.test_python_kvstore URL: https://github.com/apache/incubator-mxnet/issues/15152#issuecomment-498938781 This tutorial test was passing when running on 1 GPU machine. https://github.com/apache/incubator-mxnet/blob/master/docs/tutorials/python/kvstore.md ``` # The numbers used below assume 4 GPUs gpus = mx.context.num_gpus() if gpus > 0: contexts = [mx.gpu(i) for i in range(gpus)] else: contexts = [mx.cpu(i) for i in range(4)] ``` However, when I changed to P3 instances with 4 gpus in https://github.com/apache/incubator-mxnet/pull/15141. it fails. ``` MXNetError: [01:12:52] src/imperative/./imperative_utils.h:71: Check failed: inputs[i]->ctx().dev_mask() == ctx.dev_mask() (1 vs. 2) : Operator broadcast_add require all inputs live on the same context. But the first argument is on gpu(0) while the 2-th argument is on cpu(0) Stack trace: [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3c) [0x7f08e7052c1c] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::GetContext(nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::Context const&)+0x823) [0x7f08e9fbf343] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0xdb) [0x7f08e9fcd47b] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x1c9) [0x7f08eaab99d9] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8f) [0x7f08eaab9edf] [bt] (5) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f093764ae20] [bt] (6) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f093764a88b] [bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7f093764501a] [bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7f0937638fcb] ``` error comes from this part of the code during broad_cast add, where `stored` is `b` and on GPU, `input` is `mx.nd.ones(shape)` on CPU. but it should not give an error. ``` def update(key, input, stored): print("update on key: %d" % key) stored += input * 2 kv._set_updater(update) kv.pull(3, out=a) print(a.asnumpy()) kv.push(3, mx.nd.ones(shape)) # kv.pull(3, out=a) print(a.asnumpy()) ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services