DickJC123 opened a new issue #19353: URL: https://github.com/apache/incubator-mxnet/issues/19353
## Description This problem showed up as infrequent failures of test_operator_gpu.py::test_laop on a customized mxnet. The failures occurred when the test exercised the _potri_ operator (Inverse via Cholesky factorization) and the error was traced back to a missing synchronization in that operator's use of temporary buffers. This problem appears to be present in the current MXNet master and effects numerous linalg operations. The _potri_ operator is created with an 'FInplaceOption' and so can expect to see its input an output buffers being the same tensor. However, the GPU implementation lacks an in-place kernel, so a temporary buffer is allocated: https://github.com/apache/incubator-mxnet/blob/master/src/operator/linalg_impl.h#L876-L899. A summary of this code is: ``` <allocate a temp buffer B the same size as i/o A> <write the identity-matrix pattern to buffer B> <call trsm(A) twice to write further portions of B> <copy computed result B back to i/o A> <free buffer B> ``` At this point in the cpu's processing, the buffer B has been returned to the free pool, yet the kernels operating on B may still be enqueued on a GPU stream! The allocated memory of buffer B is now eligible to be reassigned to a GPU or COPY worker having a different stream. GPU work enqueued by these workers could then overwrite the B buffer before the _potri_ use of B is complete. I solicit comment on some options that I consider feasible: 1. Insert a cudaStreamSynchronize() before the buffer free. This is certainly the simplest, but would hang up the cpu-thread of the GPU worker. This would then 'expose' the launch overhead of the next GPU operation. 2. Add a ResourceRequest of the TempSpace to the _potri_ operator, and allocate the buffer from the temp space. This would tie the scheduling of this operator to other tempspace-using operators in the graph however. There are two _potri_ function signatures in our linalg API- one with an OpContext parameter and a lower-level one with a stream parameter (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/la_op-inl.h#L146-L164). This fix would correct the call with the OpContext parameter. A trial compile shows that the _potri_ version that takes the stream arg is not used elsewhere and can be marked private. 3. Remove the _potri_ FInplaceOption, at least when MXNet is built with USE_CUDA=1, and create a 'separate i/o' version of the implementation that needs no extra buffer. This would eliminate an unnecessary Tensor copy in the GPU implementation, but would add a copy for the CPU implementation (but only on GPU-builds). 4. The above suggestion 3 is based on the assumption that CPU and GPU operator implementations must share a single FInplaceOption spec. This limitation could be lifted possibly, or two operators defined, e.g. _potri_cpu_ and _potri_gpu_. Note that the current _potri_ implementation, besides missing synchronization, is also not CUDA Graphs compatible, since any captured buffer pointer would not be valid when the captured graph is replayed. The _potrf_, _getri_ and _getrf_ operators also have this problem, although the alloc'd buffer are not due to a 'missing i/o', and so the Tempspace solution 2 seems most appropriate here. In addition, many operators (_potrf_, _gelqf_, _orglq_, _syevd_, _gesvd_, _getrf_, _getri_) allocate and then free an int-sized 'return info' buffer that is never looked at. Because of the small size and because the contents are never looked at, this could be handled by a static, possibly shared, 'dummy allocation' presumably. @ptrendx @andrei5055 @szha @leezu ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
