anijain2305 opened a new pull request #6115: URL: https://github.com/apache/incubator-tvm/pull/6115
Using MKL for quantized dense, following the MKL fallback for FP32 dense. On C5.12x large cascade lake with VNNI support, results for BERT base are as follows (latency in ms) Type | Batch size | MXNet+MKL | TVM+MKL -- | -- | -- | -- FP32 | 128 | 33.56 | 16.83 Quantized | 128 | 23.94697 | 17.59 The overhead, between TVM FP32 and TVM quantized, is because only Dense ops are quantized in the network, and there is a cost of back-and-forth quantize and dequantize. We will investigate if quantize, dequantize can be improved. @icemelon9 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
