masahi edited a comment on pull request #7935: URL: https://github.com/apache/tvm/pull/7935#issuecomment-828042725
This post says: "They (`ds_permute` and `ds_bpermute` instructions) use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location". I don't know what they mean by "route without actually writing". https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ I wonder if both approaches use shared memory, why the explicit way as in this PR is faster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
