Laurawly opened a new pull request #5339: [TOPI] Improve get_valid_count and nms performance for CUDA URL: https://github.com/apache/incubator-tvm/pull/5339 In this PR, I update object detection ops: `get_valid_count` and `nms` by removing data arrangement on the GPU. For `get_valid_count`, there are two computations: counting the number of valid elements; rearanging the valid elements to the front of the array and marking those invalid elements as -1. By removing the data rearrangement computation, and moving it to nms's argsort, we get a speedup of 266x for `get_valid_count` op in `ssd_resnet50_v1` model. I also remove data rearrangement for `nms` on the GPU. Though getting extra work for `argsort` in `nms`, we get another 7x speedup by removing the unessarry data rearrangment on the GPU. Note that with `get_valid_count` changed, the old topi/relay tests won't work. But the end-to-end object detection accuracy doesn't drop. Here's a performance comparison table for these two ops in `ssd_resnet50_v1` with input size (1, 3, 512, 512) with Thrust turned on in build. | Operator | Time (ms) w/o this PR | Time (ms) w/ this PR | Speedup | | --- | --- | --- | --- | |`get_valid_count` | 10631.1 | 38.325 | 266 | |`non_maximum_supression` | 6060.23 | 852.702 | 7 | Also, in this PR, I remove a minor bug in deformable conv2d. @icemelon9 @kevinthesun @vinx13 please review.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
