comaniac commented on pull request #9261: URL: https://github.com/apache/tvm/pull/9261#issuecomment-942488532
Ah sorry I didn't make it clear. The interface of C source codegen does deal with dynamic workloads, because it takes the raw pointer which could be any size in run time. What I meant was how to generate CUTLASS kernels that are able to perform well with all shapes. In @Laurawly's post, they generate lots of kernels to cover possible shapes, which result in 7GB binary. I assume they also generate a run time dispatching logic (also in the generated C source code) to determine which kernel should be used given the known shape in run time. Obviously, the binary size will definitely be an issue for this solution. For JSON codegen/runtime, it would be similar to TensorRT: We simply dump a JSON graph in codegen without doing anything else. Meanwhile, we have a custom runtime that JITs/catches CUTLASS kernels based on known shapes. This results in a much smaller binary, but the first execution (or an execution with new shapes) may take several seconds or even a minute to JIT all kernels. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
