hzfan opened a new issue #15931: TBlob bug about dltensor
URL: https://github.com/apache/incubator-mxnet/issues/15931
 
 
   ## Description
   TBlob does not disable/overload the default copy constructor/assignment, so 
the default one is used. This results in shallow copy of dltensor_ (which is a 
field of type DLTensor in TBlob, see 
[here](https://github.com/apache/incubator-mxnet/blob/5a4c01bac9afd4e75227a0b4b1231bceffb204df/include/mxnet/tensor_blob.h#L415))
 and memory leak.
   
   ## Environment info (Required)
   Python 3.7.3
   Built from source (master at 5a4c01bac9afd4e75227a0b4b1231bceffb204df)
   
   ## Minimum reproducible example
   To reproduce this error, I made a minor change to the function 
[NumpyDotForward](https://github.com/apache/incubator-mxnet/blob/5a4c01bac9afd4e75227a0b4b1231bceffb204df/src/operator/numpy/np_dot-inl.h#L39)
 (in src/operator/numpy/np_dot-inl.h) for illustration.
   
   Here is the function after my modification.
   I modified one line, and added two lines (denoted by comments):
   
   ```
   template<typename xpu>
   inline void NumpyDotForward(const nnvm::NodeAttrs& attrs,
                               const OpContext& ctx,
                               const std::vector<TBlob>& inputs,
                               const std::vector<OpReqType>& req,
                               const std::vector<TBlob>& outputs) {
     using namespace mshadow;
     using namespace mxnet_op;
   
     CHECK_EQ(inputs.size(), 2U);
     CHECK_EQ(outputs.size(), 1U);
   
     const TBlob& a = inputs[0];
     const TBlob& b = inputs[1];
     // const TBlob& out = outputs[0];
     TBlob out = outputs[0];  // changed by me
     const mxnet::TShape a_shape = a.shape_;
     const mxnet::TShape b_shape = b.shape_;
     out = out.reshape(out.shape_);  // added by me
     out = TBlob(out.dltensor());  // added
     MSHADOW_REAL_TYPE_SWITCH(out.type_flag_, DType, {
       if (b_shape.ndim() < 3) {
         // Case 1, 2, 3, 4, 5: a is N-D array (N >= 1) and b is vector or 
matrix, sum product
         //        over the last axis of a and the first axis of b
         TensordotIntAxesImpl<xpu>(1, ctx, a, b, out, req[0]);
       } else {
         // Case 3, 5.5: a is N-D array and b is M-D array (M > 2), sum product 
over the last axis
         //         of a and the 2nd-to-last axis of b
         const Tuple<int> a_axes_summed({a_shape.ndim() - 1});
         const Tuple<int> b_axes_summed({b_shape.ndim() - 2});
         TensordotImpl<xpu>(a_axes_summed, b_axes_summed, ctx, a, b, out, req);
       }
     });
   }
   ```
   
   ## Steps to reproduce
   
   1. replace NumpyDotForward with the above one
   2. build
   3. run the following
   ```
   from mxnet import np
   a = np.array([[1, 2, 3], [4, 5, 6]])
   b = np.array([[1, 1], [1, 1], [1, 1]])
   np.dot(a, b)
   ```
   The expected result is 
   ```
   array([[ 6.,  6.],
             [15., 15.]])
   ```
   But the real result is
   ```
   array([[0., 0.],
              [0., 0.]])
   ```
   
   ## The cause of this problem
   TBlob.dltensor_.shape is a pointer. When TBlob b is assigned to TBlob a, the 
pointer gets shallow copied:
   ```
   a.dltensor_.shape = b.dltensor_.shape
   ```
   But b.dltensor_.shape points to b.shape_.data(). So when b is a temporary 
variable (like the return value of TBlob.reshape()), b.shape_.data() gets 
destroyed after the function returns. Now a.dltensor_.shape points to invalid 
memory.
   
   ## Possible solutions (IMO)
   - disable default assignment/copy constructor (declare them with private)
   - overload them and use SetDLTensor to avoid shallow copy
   
   Thank @yzhliu @reminisce @haojin2 for help.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to