masahi opened a new pull request #9402:
URL: https://github.com/apache/tvm/pull/9402


   Adds huge QOL improvement features when applying CUTLASS byoc to real world 
models like `bert-large`:
   
   * Cache profiling results so that we don't have to repeat profiling for the 
same `(M, N, K)`.
   * Use a new feature in CUDA 11.2 to support all generated kernels in 
parallel. `bert-large` ends up generating hundreds of kernels after profiling, 
and all of them are compiled by a single invocation of `nvcc`. Without this 
feature, compiling is too slow and painful.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to