huochaitiantang opened a new pull request #7814: URL: https://github.com/apache/tvm/pull/7814
We submit this PR to add quantization support for the [vision transform (vit)](https://arxiv.org/abs/2010.11929) model in GPU. The main change is as follows: 1, In vit model, time-consuming operators are batch_matmul, so we first add the compute and schedule of `batch_matmul_int8.cuda` in tvm.topi.cuda 2, To support the quantization of batch_matmul, we then add `batch_matmul_rewrite` and `BatchMatmulRealize` in tvm.relay.quantize 3, The kl -divergence calibrate could not preserve the accuracy of vit model well, so we add the `_percentile_scale` function For the vit-B32-224 model, the performance is as follows: - Top-1 accuracy in Imagenet validation - paper: 73.38 - nonofficial-model-fp32: 73.27 - nonofficial-model-int8: 72.78 - The latency in GTX1660 GPU - nonofficial-model-fp32: 10.32 ms - nonofficial-model-int8: 4.93 ms Thanks for your review! @jcf94 @tqchen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
