linalg_impl.h lacks proper synchronization

GitBox Wed, 14 Oct 2020 16:54:09 -0700


DickJC123 opened a new issue #19353:
URL: https://github.com/apache/incubator-mxnet/issues/19353

## Description
This problem showed up as infrequent failures of
test_operator_gpu.py::test_laop on a customized mxnet. The failures occurred
when the test exercised the _potri_ operator (Inverse via Cholesky
factorization) and the error was traced back to a missing synchronization in
that operator's use of temporary buffers. This problem appears to be present
in the current MXNet master and effects numerous linalg operations.

The _potri_ operator is created with an 'FInplaceOption' and so can expect
to see its input an output buffers being the same tensor. However, the GPU
implementation lacks an in-place kernel, so a temporary buffer is allocated:
https://github.com/apache/incubator-mxnet/blob/master/src/operator/linalg_impl.h#L876-L899.
A summary of this code is:
```
<allocate a temp buffer B the same size as i/o A>
<write the identity-matrix pattern to buffer B>
<call trsm(A) twice to write further portions of B>
<copy computed result B back to i/o A>
<free buffer B>
```
At this point in the cpu's processing, the buffer B has been returned to the
free pool, yet the kernels operating on B may still be enqueued on a GPU
stream! The allocated memory of buffer B is now eligible to be reassigned to a
GPU or COPY worker having a different stream. GPU work enqueued by these
workers could then overwrite the B buffer before the _potri_ use of B is
complete.

I solicit comment on some options that I consider feasible:
1. Insert a cudaStreamSynchronize() before the buffer free. This is
certainly the simplest, but would hang up the cpu-thread of the GPU worker.
This would then 'expose' the launch overhead of the next GPU operation.
2. Add a ResourceRequest of the TempSpace to the _potri_ operator, and
allocate the buffer from the temp space. This would tie the scheduling of this
operator to other tempspace-using operators in the graph however. There are
two _potri_ function signatures in our linalg API- one with an OpContext
parameter and a lower-level one with a stream parameter (see
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/la_op-inl.h#L146-L164).
This fix would correct the call with the OpContext parameter. A trial
compile shows that the _potri_ version that takes the stream arg is not used
elsewhere and can be marked private.
3. Remove the _potri_ FInplaceOption, at least when MXNet is built with
USE_CUDA=1, and create a 'separate i/o' version of the implementation that
needs no extra buffer. This would eliminate an unnecessary Tensor copy in the
GPU implementation, but would add a copy for the CPU implementation (but only
on GPU-builds).
4. The above suggestion 3 is based on the assumption that CPU and GPU
operator implementations must share a single FInplaceOption spec. This
limitation could be lifted possibly, or two operators defined, e.g. _potri_cpu_
and _potri_gpu_.

Note that the current _potri_ implementation, besides missing
synchronization, is also not CUDA Graphs compatible, since any captured buffer
pointer would not be valid when the captured graph is replayed.

The _potrf_, _getri_ and _getrf_ operators also have this problem, although
the alloc'd buffer are not due to a 'missing i/o', and so the Tempspace
solution 2 seems most appropriate here. In addition, many operators (_potrf_,
_gelqf_, _orglq_, _syevd_, _gesvd_, _getrf_, _getri_) allocate and then free an
int-sized 'return info' buffer that is never looked at. Because of the small
size and because the contents are never looked at, this could be handled by a
static, possibly shared, 'dummy allocation' presumably.

@ptrendx @andrei5055 @szha @leezu

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-mxnet] DickJC123 opened a new issue #19353: Use of temp buffers in src/operator/linalg_impl.h lacks proper synchronization

Reply via email to