paleolimbot opened a new pull request, #509: URL: https://github.com/apache/arrow-nanoarrow/pull/509
This PR implements asychronous buffer copying when copying CUDA buffers. Before this, we had basically been issuing `cuMemCopyDtoH/HtoD()` a lot of times in a row with a synchronize up front and a synchronize at the end. This was probably not great for performance. Additionally, for copying String/Binary/Large String/Large Binary arrays from CUDA to the CPU, we were issuing very tiny copies on the offsets buffer and synchronizing with the CPU to get the number of bytes to copy for the data buffer. After this PR, when copying from CPU to CUDA, we will be able to return before the copy is necessarily completed by setting the output `sync_event`. When copying from CUDA to CPU, the copy is done in one pass if there are no string/binary arrays, or two passes if there are. When copying string/binary arrays, the implementation walks the entire tree of arrays and issues asynchronous copies for the last offset value. Then, the stream is synchronized with the CPU, and a second set of asynchronous copies are issued for the buffers whose size we now know. I don't have much experience with CUDA async programming to know if this approach could be simplified (e.g., I do this in two streams, but it might be that one stream is sufficient since perhaps all of the device -> host copies are getting queued against eachother regardless of what stream they're on). This will be easier to test when (e.g., bigger, non-trivial data) it is wired up to Python. TODO: - Implement `sync_event` integration (both for source and destination) - Test more than just a few string arrays Closes #245. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
