[GitHub] [incubator-tvm] t-vi commented on pull request #5600: [TOPI] Improve CUDA softmax scheduling
t-vi commented on pull request #5600: URL: https://github.com/apache/incubator-tvm/pull/5600#issuecomment-638823562 @wpan11nv Thanks for your offer to help. I submitted the clean-up #5726 and then in #5727 I add ROCm warp reductions. One of the things I did was to avoid assuming a fixed warp-size of 32 in the TIR transformations before codegen. Thank you for improving softmax btw - it was something that looked funny with the four kernels before. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-tvm] t-vi commented on pull request #5600: [TOPI] Improve CUDA softmax scheduling
t-vi commented on pull request #5600: URL: https://github.com/apache/incubator-tvm/pull/5600#issuecomment-638622419 I'm adding shfl intrinsics to the rocm bits (using `tvm.intrin.rule.rocm.tvm_warp_shuffle /-up/-down` definitions). I'm currently seeing a funny effect where I get a `tvm_thread_allreduce` call with null arguments in `lower_thread_allreduce`'s `MakeAllreduce`. Eventually, I hope to get to the codegen - when I'll probably run into the nvptx bits in the llvm codegen. Is there a reason not to use the intrin.rule mechanism for nvptx? I'm not sure running `gpu_imagenet_bench.py` (which I'm using as the first stop of seeing if anything works) with the nvptx target works for me (though I get to the codegen for that), but I would not know if it worked before... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-tvm] t-vi commented on pull request #5600: [TOPI] Improve CUDA softmax scheduling
t-vi commented on pull request #5600: URL: https://github.com/apache/incubator-tvm/pull/5600#issuecomment-638329923 I'll just work on a fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-tvm] t-vi commented on pull request #5600: [TOPI] Improve CUDA softmax scheduling
t-vi commented on pull request #5600: URL: https://github.com/apache/incubator-tvm/pull/5600#issuecomment-638275567 So ROCm uses the CUDA schedule, but warp reductions don't seem to currently work (so arguably, ROCm would want to be improved). But so before this PR, one could run resnet18 with rocm backend and now one cannot. This can also be seen earlier, when running the warp reduction tests on ROCm. I've looked a bit into fixing it, but I haven't fully understood from which of the three related patches this stems. (Incidentally, it triggered also a corner case for me on cuda where nvrtc would accidentally use an cuda-8.0 instead of the 10.1 the libnvrtc belonged to.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-tvm] t-vi commented on pull request #5600: [TOPI] Improve CUDA softmax scheduling
t-vi commented on pull request #5600: URL: https://github.com/apache/incubator-tvm/pull/5600#issuecomment-638068589 This broke the ROCm backend. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org