Hzfengsy opened a new pull request, #13534: URL: https://github.com/apache/tvm/pull/13534
This PR introduces a new pass UnifyKernelLaunch to fuse the nearby kernels with same thread config, which will reduce the kernel launch overhead in `split` operator. Test on the following settings: ```python A = te.placeholder((75, 182), "int64") B = topi.split(A, 182, axis=1) func = te.create_prim_func([A, *B]) ``` The pass will reduce the latency from 347.03 us to 5.22 us on my machine. cc @MasterJH5574 @vinx13 @junrushao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
