huochaitiantang opened a new pull request #7814:
URL: https://github.com/apache/tvm/pull/7814


   We submit this PR to add quantization support for the [vision transform 
(vit)](https://arxiv.org/abs/2010.11929) model in GPU. The main change is as 
follows:
   
   1, In vit model,  time-consuming operators are batch_matmul,  so we first 
add the compute and schedule of `batch_matmul_int8.cuda` in tvm.topi.cuda
   
   2, To support the quantization of batch_matmul, we then add 
`batch_matmul_rewrite` and `BatchMatmulRealize` in tvm.relay.quantize 
   
   3, The kl -divergence calibrate could not preserve the accuracy of vit model 
well, so we add the `_percentile_scale` function 
   
   For the vit-B32-224 model, the performance is as follows:
   
   - Top-1 accuracy in Imagenet validation
     - paper: 73.38
     - nonofficial-model-fp32: 73.27
     - nonofficial-model-int8: 72.78
   
   - The latency in GTX1660 GPU
     - nonofficial-model-fp32: 10.32 ms
     - nonofficial-model-int8: 4.93 ms
   
   Thanks for your review! @jcf94 @tqchen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to