MasterJH5574 opened a new pull request, #17236: URL: https://github.com/apache/tvm/pull/17236
This PR bumps FlashInfer and updates PagedKVCache accordingly for performance improvement. Some notes on this bump: * When the Grouped-Query Attention group size is at least 4 and FlashInfer is enabled, we use the prefill attn kernel for better performance. * We enlarge the temporary workspace for FlashInfer use accordingly, as FlashInfer in the current version may consume much larger workspace. We turn off the workspace when FlashInfer is not enabled. * We reduce the max block depth to be 1, in observation of the limited help of cascade inference when batch size is not large and the prompt reuse is low. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
