MasterJH5574 opened a new pull request, #16692: URL: https://github.com/apache/tvm/pull/16692
This PR enhances PagedKVCache with the copy stream separation. In detail, for CUDA and ROCm backend, we create a standalone copy stream for the copy of auxiliary data structure from CPU to GPU. Furthermore, we move the copy from BeginForward to Attention, which means it's no longer eagerly executed, instead, becoming lazily executed when Attention computation is needed. By making these changes, we are able to overlap the auxiliary data copy time (on the copy stream) with the model forward computation that happens before the first Attention. As a result, we can hide some of the copy latency. This PR also bumps the version of FlashInfer for the copy stream support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
