zhuwenxi opened a new pull request #8479:
URL: https://github.com/apache/tvm/pull/8479
1. Split into 2 kernels, one does the "Init" and another does the "Update".
Thus they can have different Grid/Block configurations to better utilize
SMs.
2. Use atomic_add instead of direct assignment, which could avoid the race
condtion when multiple indices point to the same location of the output
tensor. With this moidification, it's safe now to use more CUDA threads
to gain more parallelism.
Detail discussion:
https://discuss.tvm.apache.org/t/topi-cuda-scatter-nd-has-a-very-poor-performance-on-cuda-backend-1000x-slower-than-hand-written-cuda-code/10426
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]