Hzfengsy opened a new pull request, #13534:
URL: https://github.com/apache/tvm/pull/13534

   This PR introduces a new pass UnifyKernelLaunch to fuse the nearby kernels 
with same thread config, which will reduce the kernel launch overhead in 
`split` operator.
   
   Test on the following settings:
   
   ```python
   A = te.placeholder((75, 182), "int64")
   B = topi.split(A, 182, axis=1)
   
   func = te.create_prim_func([A, *B])
   ```
   
   The pass will reduce the latency from 347.03 us to 5.22 us on my machine.
   
   cc @MasterJH5574 @vinx13 @junrushao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to