masahi opened a new pull request #9402: URL: https://github.com/apache/tvm/pull/9402
Adds huge QOL improvement features when applying CUTLASS byoc to real world models like `bert-large`: * Cache profiling results so that we don't have to repeat profiling for the same `(M, N, K)`. * Use a new feature in CUDA 11.2 to support all generated kernels in parallel. `bert-large` ends up generating hundreds of kernels after profiling, and all of them are compiled by a single invocation of `nvcc`. Without this feature, compiling is too slow and painful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
