Lunderberg commented on code in PR #12720: URL: https://github.com/apache/tvm/pull/12720#discussion_r971016392
########## python/tvm/tir/tensor_intrin/cuda.py: ########## @@ -36,7 +36,7 @@ def shared_16x32_to_ldmatrix_32x16_layout(i, j): def shared_32x16_to_ldmatrix_32x16_layout(i, j): - thread_id = (i % 4) + 4 * (j % 8) + thread_id = (i % 16) // 4 + 4 * (j % 8) Review Comment: Thank you for looking into it! I wasn't able to find any tests that explicitly validate the transform (e.g. use the transform to generate data in a specific layout, then pass through the mma), as all the tests either started with transformed data, only used the 16x16 shape, or replaced everything with the tensor intrinsic. I had put together [this standalone test](https://gist.github.com/Lunderberg/0c2a44de34e7e2a1d149c37b2a112f91) to convince myself on it. The main issue with the current index map is that it doesn't map to unique locations (512 input indices map to 128 output indices). It only arose as an issue in this PR, because it generates the inverse in order to determine whether/where padding is required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
