mbrookhart commented on pull request #7303: URL: https://github.com/apache/tvm/pull/7303#issuecomment-762969945
Scan is probably the most hand-optimized kernel in thrust, I'm thrilled to be within 10x for a cross-GPU kernel. Overall I'm happy with this, but I have 2 thoughts. 1. Should we add the TIR inclusive scan back in? I have that on a branch from my first implementation of get_valid_counts: https://github.com/mbrookhart/tvm/commit/944ee3c62d3176e86d555c85097c45c88d082204 2. We should probably generalize for rank, I think maybe we can use the same kind of before/after trick used in sort: https://github.com/apache/tvm/blob/f91b51d638874973a2d9ccbcb4d49cf7c668f516/python/tvm/topi/cuda/sort.py#L69-L85 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
