psrivas2 opened a new pull request, #14465: URL: https://github.com/apache/tvm/pull/14465
This PR improves cutlass compilation time, by compiling a single CSourceModule instead of creating and compiling one for each kernel. Creating and compiling a new CSourceModule for every function is quite slow and slows down model with multiple functions offloaded to cutlass quite significantly. Instead we can generate a single CSourceModule and compile it once to produce a single `runtime::Module`. This brings down the cutlass compilation time of large models like SD Unet significantly (~30 min to ~4 min). Similar results on other large models. #### Testing `tests/python/relax/test_codegen_cutlass.py::test_matmul_offload` is broken at HEAD. This PR passes on all other tests when tested locally. cc @masahi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
