fxmarty commented on PR #15111: URL: https://github.com/apache/tvm/pull/15111#issuecomment-1614601193
Thank you, I'm trying to connect the dots between the different opinions I hear around using mma.sync.aligned op (which from what I understand of cutlass, the fastertransformer extension should come down to) vs not for GEMV, and your results are in line with my current intuition. Now the only mystery remaining to me is why pytorch chooses a tensor core based kernel for GEMV in fp16 * fp16. Maybe there could be opportunities to have a cutlass extension based rather on https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/gemv.h / https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/gemv_batched_strided.h / for the decoding, along with a gemm kernel for the prefill. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
