masahi edited a comment on pull request #7257: URL: https://github.com/apache/tvm/pull/7257#issuecomment-758901561
I don't quite follow, maybe you are missing something? First, this PR doesn't change our NMS API, it only changes the buffer layouts used internally. Second, the final concat is only required for MXNet, which uses `return_indices=False`. Our NMS returns something completely different depending on `return_indices` flag : If True, it returns a big output tensor packed with bboxes, scores and class ids, with invalid entries indicated by -1. (The valid entries are supposed to move to the top, if ` invalid_to_bottom` flag is True. But our GPU NMS kernel ignores this argument and the output is not reordered. This is another difference with CPU implementation, I think this is a bug) https://github.com/apache/tvm/blob/4e8cc4fc26e931e38017d198d29f45cba04f5a60/python/tvm/topi/cuda/nms.py#L762-L763 If `return_indices=True`, which applies to TF, ONNX, and PyTorch, we only return survived box indices, so there is no need to concat bboxes, scores, and class ids. For this case, there is zero additional overhead after this PR, it is just that concat before NMS done by frontends are now completely pointless. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
