samskalicky opened a new issue #14727: shape input names order mismatch after partitioning URL: https://github.com/apache/incubator-mxnet/issues/14727 ## Description The input names of a symbol are produced by a DFS traversal of the symbol's graph from the outputs back up to the inputs. During graph partitioning, some nodes are added to subgraphs, thus potentially changing the order of the DFS traversal. After graph partitioning, shape propagation occurs, and the inferred shapes for the inputs are returned in the order that they appear in a DFS traversal. However, when graph partitioning happens and the DFS traversal order changes, the inferred shapes may be returned in a different order than expected. Since the original symbol is not modified, the caller is expecting the shapes in the same order as the original symbol. Since DFS order is not guaranteed to be identical before and after partitioning, we need to map the names-to-shapes and ensure that the shapes are returned in the original order. ## Environment info (Required) The error occurs on every release, and is reproducible on the master branch. I have built from source using the master branch and reproduced the problem. ## Error Message: ``` Traceback (most recent call last): File "run.py", line 139, in <module> mod.set_params(arg_params, aux_params, allow_missing=True) File "/home/ubuntu/mxnet/python/mxnet/module/module.py", line 358, in set_params self._exec_group.set_params(arg_params, aux_params, allow_extra=allow_extra) File "/home/ubuntu/mxnet/python/mxnet/module/executor_group.py", line 413, in set_params exec_.copy_params_from(arg_params, aux_params, allow_extra_params=allow_extra) File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 361, in copy_params_from array.astype(dst.dtype).copyto(dst) File "/home/ubuntu/mxnet/python/mxnet/ndarray/ndarray.py", line 2089, in copyto return _internal._copyto(self, out=other) File "<string>", line 25, in _copyto File "/home/ubuntu/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke ctypes.byref(out_stypes))) File "/home/ubuntu/mxnet/python/mxnet/base.py", line 254, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [22:29:46] src/operator/random/./../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node at 0-th output: expected [1,1,128,128,60], got [15,1024,1,1] Stack trace: [bt] (0) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fe5684779a2] [bt] (1) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseAttr<mxnet::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*) const+0x2202) [0x7fe56868d322] [bt] (2) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<1l, 1l>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x410) [0x7fe568692db0] [bt] (3) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0xe8a) [0x7fe56a74d87a] [bt] (4) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x368) [0x7fe56a753a28] [bt] (5) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0xb2a) [0x7fe56ae4fd6a] [bt] (6) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x534) [0x7fe56ae518f4] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe578ef5e40] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fe578ef58ab] ``` ## Minimum reproducible example This problem occurs on a few models, the one that I can share is the faster-rcnn model from the GluonCV package. Here is how to get the model: ``` #get model import gluoncv as cv model = cv.model_zoo.faster_rcnn_resnet50_v1b_coco(pretrained=True) im_fname = cv.utils.download('https://github.com/dmlc/web-data/blob/master/gluoncv/detection/biking.jpg?raw=true', path='biking.jpg') x, orig_img = cv.data.transforms.presets.rcnn.load_test(im_fname) model.hybridize() box_ids, scores, bboxes = model(x) model.export('faster-rcnn') ``` Once the model is exported, here is the code to reproduce the error using CPU context: ``` import mxnet as mx import numpy as np from collections import namedtuple Batch = namedtuple('Batch', ['data']) import os from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array op_names = [ "_add", "_contrib_MultiBoxDetection", "_contrib_MultiBoxPrior", "_contrib_MultiBoxTarget", "_copy", "_div_scalar", "_DivScalar", "_minus", "_Minus", "_minus_scalar", "_MinusScalar", "_mul", "_Mul", "_mul_scalar", "_MulScalar", "_plus", "_Plus", "_plus_scalar", "_PlusScalar", "_rdiv_scalar", "_RDivScalar", "_rminus_scalar", "_RMinusScalar", "_rnn_param_concat", "_sub", "abs", "Activation", "arccos", "arccosh", "arcsin", "arcsinh", "arctan", "arctanh", "argmax", "argmin", "BatchNorm", "BatchNorm_v1", "BlockGrad", "broadcast_add", "broadcast_equal", "broadcast_greater", "broadcast_greater_equal", "broadcast_lesser", "broadcast_lesser_equal", "broadcast_mul", "broadcast_not_equal", "broadcast_plus", "cast", "Cast", "clip", "concat", "Concat", "Convolution", "Convolution_v1", "cos", "cosh", "crop", "Deconvolution", "Dropout", "elemwise_add", "elemwise_mul", "elemwise_sub", "Embedding", "exp", "expand_dims", "flatten", "Flatten", "flip", "FullyConnected", "identity", "identity", "LeakyReLU", "LinearRegressionOutput", "log", "log_softmax", "LRN", "make_loss", "MakeLoss", "max", "max_axis", "mean", "min", "min_axis", "negative", "one_hot", "pad", "Pad", "pick", "Pooling", "Pooling_v1", "prod", "reciprocal", "relu", "repeat", "reshape", "Reshape", "reverse", "RNN", "rsqrt", "sigmoid", "sin", "sinh", "slice", "SliceChannel", "softmax", "SoftmaxActivation", "SoftmaxOutput", "softmin", "split", "sqrt", "sum", "sum_axis", "tan", "tanh", "tile", "topk", "transpose", "zeros_like" ] check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str("default"), mx_uint(len(op_names)), c_str_array(op_names))) os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default' ctx = mx.cpu() sym, arg_params, aux_params = mx.model.load_checkpoint('faster-rcnn', 0) mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None) mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],label_shapes=mod._label_shapes) mod.set_params(arg_params, aux_params, allow_missing=True) fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true') img = mx.image.imread(fname) # convert into format (batch, RGB, width, height) img = mx.image.imresize(img, 224, 224) # resize img = img.transpose((2, 0, 1)) # Channel first img = img.expand_dims(axis=0) # batchify mod.forward(Batch([img])) print(mod.get_outputs()) ``` ## What have you tried to solve it? Ive tested a fix in a private branch: https://github.com/samskalicky/incubator-mxnet/commit/517d29498059d081873d1bd160d95479a5c8cea9
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
