masahi commented on pull request #7233: URL: https://github.com/apache/tvm/pull/7233#issuecomment-757016060
Yes, there are 4 calls to 4D scatter in MaskRCNN, the old kernel was taking 11.6 milli seconds on them in total, making it one of the bottlenecks as shown in the profile above. This change brings it down to 1.9873 milli seconds total and it is no longer a bottleneck. So this is a solid improvement. I think the reason the old kernel was slow for this input (1000, 256, 7, 7) is because thread block is too small (32, 1, 1) and we are launching too many of them (1000 * 256 * 7 blocks). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
