zhuwenxi opened a new pull request #8479:
URL: https://github.com/apache/tvm/pull/8479


   1. Split into 2 kernels, one does the "Init" and another does the "Update".
      Thus they can have different Grid/Block configurations to better utilize
      SMs.
   2. Use atomic_add instead of direct assignment, which could avoid the race
      condtion when multiple indices point to the same location of the output
      tensor. With this moidification, it's safe now to use more CUDA threads
      to gain more parallelism.
   
   Detail discussion: 
https://discuss.tvm.apache.org/t/topi-cuda-scatter-nd-has-a-very-poor-performance-on-cuda-backend-1000x-slower-than-hand-written-cuda-code/10426
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to