MasterJH5574 opened a new pull request, #16750: URL: https://github.com/apache/tvm/pull/16750
This PR introduces the CUDA IPC memory support in TVM runtime. IPC memory allows multiple distribtued workers accessing the GPU memory of each other directly. This functionality is helpful for implementing customzied communication primitives across distributed workers. In this PR, we bring the customized all-reduce implementation from TensorRT-LLM into 3rdparty. This all-reduce implementation makes use of the CUDA IPC memory. We expose the all-reduce function in global function under namespace `tvm::runtime::disco::cuda_ipc`. One unit test for the customized all-reduce kernel over two workers is added. --- Co-authored-by: Hongyi Jin <[email protected]> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
