masahi edited a comment on pull request #7137: URL: https://github.com/apache/tvm/pull/7137#issuecomment-750960526
ok I ran CPU perf comparison with old code vs new code + default param. As expected, there is almost no perf regression. ``` Old code: 2.1364135856199997 sec After the fix in this PR: 2.1756499899999997 sec ``` The old code spends 5 milliseconds on NMS. Here is the stat from profile. https://github.com/masahi/torchscript-to-tvm/blob/master/maskrcnn/maskrcnn_cpu_vm_profile_old.txt#L51 The new code gets far more boxes, but NMS spends only 20 milliseconds on it. https://github.com/masahi/torchscript-to-tvm/blob/master/maskrcnn/maskrcnn_cpu_vm_profile_1000.txt#L24 The reason the new code doesn't make CPU perf worse is simple. Even though now NMS gets more boxes, PyTorch detection models does post NMS topk after RPN, see https://github.com/pytorch/vision/blob/90645ccd0e774ad76200245e32222a23d09f2312/torchvision/models/detection/rpn.py#L261-L263 So no matter how many boxes NMS gets, the number of boxes after NMS + post NMS topk doesn't change and bounded by `rpn_pre_nms_top_n_test` parameter whose default is 1000. The CPU perf is always dominated by the large dense layer. Moreover, I'd like to reiterate that the new code is more correct and gives essentially the same output as PyTorch. At this point, if you are still not happy with the this fix, I'm genuinely curious why. I suggest looking into the details of the models you are compiling and doing benchmark carefully before making baseless claims. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
