masahi commented on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828042725


   This post says: "They (`ds_permute` and `ds_bpermute` instructions) use LDS 
hardware to route data between the 64 lanes of a wavefront, but they don’t 
actually write to an LDS location"
   https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/
   
   I wonder if both approaches use shared memory, why the explicit way as in 
this PR is faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to