MasterJH5574 commented on PR #15373:
URL: https://github.com/apache/tvm/pull/15373#issuecomment-1645027161

   > LGTM! I was curious if there's any performance implication of this change?
   
   @junrushao I didn’t measure. For platform like CUDA with warp size 32, the 
additional shared memory will have at most 16 elements, and I assume this 
overhead is negligible. Nevertheless, in multi-warp reduction settings, the 
current impl with warp-level primitive leverage will at least be no slower than 
the status before #15327, which allocated large shared memory used naive 
cross-thread reduction implementation over shared memory.
   
   On the other hand, to fulfill the semantics of allreduce, we have to 
compromise the shared memory here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to