hzfan opened a new issue #15931: TBlob bug about dltensor URL: https://github.com/apache/incubator-mxnet/issues/15931 ## Description TBlob does not disable/overload the default copy constructor/assignment, so the default one is used. This results in shallow copy of dltensor_ (which is a field of type DLTensor in TBlob, see [here](https://github.com/apache/incubator-mxnet/blob/5a4c01bac9afd4e75227a0b4b1231bceffb204df/include/mxnet/tensor_blob.h#L415)) and memory leak. ## Environment info (Required) Python 3.7.3 Built from source (master at 5a4c01bac9afd4e75227a0b4b1231bceffb204df) ## Minimum reproducible example To reproduce this error, I made a minor change to the function [NumpyDotForward](https://github.com/apache/incubator-mxnet/blob/5a4c01bac9afd4e75227a0b4b1231bceffb204df/src/operator/numpy/np_dot-inl.h#L39) (in src/operator/numpy/np_dot-inl.h) for illustration. Here is the function after my modification. I modified one line, and added two lines (denoted by comments): ``` template<typename xpu> inline void NumpyDotForward(const nnvm::NodeAttrs& attrs, const OpContext& ctx, const std::vector<TBlob>& inputs, const std::vector<OpReqType>& req, const std::vector<TBlob>& outputs) { using namespace mshadow; using namespace mxnet_op; CHECK_EQ(inputs.size(), 2U); CHECK_EQ(outputs.size(), 1U); const TBlob& a = inputs[0]; const TBlob& b = inputs[1]; // const TBlob& out = outputs[0]; TBlob out = outputs[0]; // changed by me const mxnet::TShape a_shape = a.shape_; const mxnet::TShape b_shape = b.shape_; out = out.reshape(out.shape_); // added by me out = TBlob(out.dltensor()); // added MSHADOW_REAL_TYPE_SWITCH(out.type_flag_, DType, { if (b_shape.ndim() < 3) { // Case 1, 2, 3, 4, 5: a is N-D array (N >= 1) and b is vector or matrix, sum product // over the last axis of a and the first axis of b TensordotIntAxesImpl<xpu>(1, ctx, a, b, out, req[0]); } else { // Case 3, 5.5: a is N-D array and b is M-D array (M > 2), sum product over the last axis // of a and the 2nd-to-last axis of b const Tuple<int> a_axes_summed({a_shape.ndim() - 1}); const Tuple<int> b_axes_summed({b_shape.ndim() - 2}); TensordotImpl<xpu>(a_axes_summed, b_axes_summed, ctx, a, b, out, req); } }); } ``` ## Steps to reproduce 1. replace NumpyDotForward with the above one 2. build 3. run the following ``` from mxnet import np a = np.array([[1, 2, 3], [4, 5, 6]]) b = np.array([[1, 1], [1, 1], [1, 1]]) np.dot(a, b) ``` The expected result is ``` array([[ 6., 6.], [15., 15.]]) ``` But the real result is ``` array([[0., 0.], [0., 0.]]) ``` ## The cause of this problem TBlob.dltensor_.shape is a pointer. When TBlob b is assigned to TBlob a, the pointer gets shallow copied: ``` a.dltensor_.shape = b.dltensor_.shape ``` But b.dltensor_.shape points to b.shape_.data(). So when b is a temporary variable (like the return value of TBlob.reshape()), b.shape_.data() gets destroyed after the function returns. Now a.dltensor_.shape points to invalid memory. ## Possible solutions (IMO) - disable default assignment/copy constructor (declare them with private) - overload them and use SetDLTensor to avoid shallow copy Thank @yzhliu @reminisce @haojin2 for help.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
