MasterJH5574 commented on PR #15373: URL: https://github.com/apache/tvm/pull/15373#issuecomment-1645027161
> LGTM! I was curious if there's any performance implication of this change? @junrushao I didn’t measure. For platform like CUDA with warp size 32, the additional shared memory will have at most 16 elements, and I assume this overhead is negligible. Nevertheless, in multi-warp reduction settings, the current impl with warp-level primitive leverage will at least be no slower than the status before #15327, which allocated large shared memory used naive cross-thread reduction implementation over shared memory. On the other hand, to fulfill the semantics of allreduce, we have to compromise the shared memory here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
