DickJC123 opened a new issue #19353:
URL: https://github.com/apache/incubator-mxnet/issues/19353


   ## Description
   This problem showed up as infrequent failures of 
test_operator_gpu.py::test_laop on a customized mxnet.  The failures occurred 
when the test exercised the _potri_ operator (Inverse via Cholesky 
factorization) and the error was traced back to a missing synchronization in 
that operator's use of temporary buffers.  This problem appears to be present 
in the current MXNet master and effects numerous linalg operations.
   
   The _potri_ operator is created with an 'FInplaceOption' and so can expect 
to see its input an output buffers being the same tensor.  However, the GPU 
implementation lacks an in-place kernel, so a temporary buffer is allocated: 
https://github.com/apache/incubator-mxnet/blob/master/src/operator/linalg_impl.h#L876-L899.
  A summary of this code is:
   ```
   <allocate a temp buffer B the same size as i/o A>
   <write the identity-matrix pattern to buffer B>
   <call trsm(A) twice to write further portions of B>
   <copy computed result B back to i/o A>
   <free buffer B>
   ```
   At this point in the cpu's processing, the buffer B has been returned to the 
free pool, yet the kernels operating on B may still be enqueued on a GPU 
stream!  The allocated memory of buffer B is now eligible to be reassigned to a 
GPU or COPY worker having a different stream.  GPU work enqueued by these 
workers could then overwrite the B buffer before the _potri_ use of B is 
complete.
   
   I solicit comment on some options that I consider feasible:
    1. Insert a cudaStreamSynchronize() before the buffer free.  This is 
certainly the simplest, but would hang up the cpu-thread of the GPU worker.  
This would then 'expose' the launch overhead of the next GPU operation.
    2. Add a ResourceRequest of the TempSpace to the _potri_ operator, and 
allocate the buffer from the temp space.  This would tie the scheduling of this 
operator to other tempspace-using operators in the graph however.  There are 
two _potri_ function signatures in our linalg API- one with an OpContext 
parameter and a lower-level one with a stream parameter (see 
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/la_op-inl.h#L146-L164).
  This fix would correct the call with the OpContext parameter.  A trial 
compile shows that the _potri_ version that takes the stream arg is not used 
elsewhere and can be marked private.
    3. Remove the _potri_ FInplaceOption, at least when MXNet is built with 
USE_CUDA=1, and create a 'separate i/o' version of the implementation that 
needs no extra buffer.  This would eliminate an unnecessary Tensor copy in the 
GPU implementation, but would add a copy for the CPU implementation (but only 
on GPU-builds).
    4. The above suggestion 3 is based on the assumption that CPU and GPU 
operator implementations must share a single FInplaceOption spec.  This 
limitation could be lifted possibly, or two operators defined, e.g. _potri_cpu_ 
and _potri_gpu_.
   
   Note that the current _potri_ implementation, besides missing 
synchronization, is also not CUDA Graphs compatible, since any captured buffer 
pointer would not be valid when the captured graph is replayed.
   
   The _potrf_, _getri_ and _getrf_ operators also have this problem, although 
the alloc'd buffer are not due to a 'missing i/o', and so the Tempspace 
solution 2 seems most appropriate here.  In addition, many operators (_potrf_, 
_gelqf_, _orglq_, _syevd_, _gesvd_, _getrf_, _getri_) allocate and then free an 
int-sized 'return info' buffer that is never looked at.  Because of the small 
size and because the contents are never looked at, this could be handled by a 
static, possibly shared, 'dummy allocation' presumably.
   
   @ptrendx @andrei5055 @szha @leezu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to