FrozenGene commented on issue #4828: [QNN][TFLite] TFLite rounding mode support
URL: https://github.com/apache/incubator-tvm/pull/4828#issuecomment-611288188
 
 
   > Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For 
TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms 
(TFLite is 35 ms).  Upon deeper analysis, I found requantize to be the 
bottleneck (even with UPWARD rounding which is the cheapest). The reason is 
that calculations happen in int64 and either ARM does not have good 
instructions or LLVM does not pick up the right instructions.
   > 
   > 
   > 
   > As an experiment, I forced the datatype of Requantize to be int32. This 
will lead to bad accuracy but it will give us an idea what is the minimum 
latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms 
(almost as good as TFLite). So, requantize is the bottleneck.
   > 
   > 
   > 
   > Additionally, we should notice that TFLite rounding is not a standard 
rounding. They have moved backwards from the ARM instructions they want to use. 
Because of that they have 2 roundings instead of 1 - 1 in 
`SaturatingRoundingDoublingHighMul` (non-standard rounding) and other is 
   > 
   > `RoundingDivideByPOT` (standard). Given there are 2 roundings, they have 
also made an accuracy-performance tradeoff. So, my suggestion is that we should 
also wisely think if it makes sense to completely follow TFLite rounding as the 
golden reference. For example, following reference function
   > 
   > 
   > 
   > ~~~
   > 
   > // This function implements the same computation as the ARMv7 NEON VQRDMULH
   > 
   > // instruction.
   > 
   > 
   > 
   > template <>
   > 
   > inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
   > 
   >                                                       std::int32_t b) {
   > 
   >   bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
   > 
   >   std::int64_t a_64(a);
   > 
   >   std::int64_t b_64(b);
   > 
   >   std::int64_t ab_64 = a_64 * b_64;
   > 
   >   std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
   > 
   >   std::int32_t ab_x2_high32 =
   > 
   >       static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
   > 
   >   return overflow ? std::numeric_limits<std::int32_t>::max() : 
ab_x2_high32;
   > 
   > }
   > 
   > ~~~
   > 
   > 
   > 
   > This function alone needs around 10 Relay ops (if not more), but in TFLite 
this is just  1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to 
use that instruction automatically.
   > 
   > 
   > 
   > In summary, I understand the benefits of exact match between TFLite and 
TVM. But in contrast to FP32 networks, this is one of the cases, where 
frameworks have made their choice to tradeoff perf and accuracy for a specific 
ARM device. Note that with non-TFLite rounding, the application accuracy is 
still similar, so it might be an acceptable tradeoff. Therefore, I would 
encourage to think more deeply. I worry that the extensive work going in this 
PR might need major reconsideration when we dont see good performance 
improvements.
   > 
   > 
   > 
   > If we relax the exact tensor match criteria, there might be other ways to 
get the perfromance like using tensorize.
   > 
   >  
   > 
   > @u99127 @masahi might also be interested in this.
   > 
   > 
   > 
   > 
   > 
   > 
   
   Please note [this 
thread](https://discuss.tvm.ai/t/supporting-bit-exact-tflite-qnn-inference/5528),
 if we don’t have bit-extract result, we will have many problems, not just 
trade off. Mobilenet accuracy result is not enough.
   
   You could also one 
[blog](https://github.com/apache/incubator-tvm-site/pull/2) how do boost the 
quant perf. Inside it, we could get the bit-extract result and use the same 
rouding of TFLite, but we could beat TFLite performance a lot. However, the 
design has a little bit, we design to implement one q_conv2d / q_requantize to 
contorl it. For q_conv2d, we tensorize it and do requantize inside it. For 
depthwise convolution, we even not tensorize and use q_conv2d + q_requantize, 
we could still beat TFLite and QNNPack a lot. Out current design of one problem 
is we lower too many ops and can not make sure all computations inside register 
and NCHW has bad cache like previous blog said.
   
   We should make sure functionality firstly. Then we consider the performance. 
I think it is not only the arm instruction you mention but also many other 
things we need to do.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to