Hi @FrozenGene ,
Thanks a lot for your comments.  I will address general replies here, and code 
comments in a separate reply.

* I indeed read your discuss 
[post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4),
 but I thought the work was orthogonal to this one. My main goal here is to 
have a fast general convolution algorithm for armv8. Your post talks about 
mobilenet v2, and raspi 3. 
* In mobilenet v2 there are no deep convolutional layers, mostly depthwise 
convolutions and 1x1 convolutions. With shallow convolutions the problem 
becomes memory bound, and the differences among the algorithms  become less 
evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 
1x7, 7x1 convolutions. 
* Raspi 3 comes with a 32bit operative system, which means using armv7. The 
problem with armv7 is that instead of having 32 registers (as in armv8) you 
have only 16, so the optimization space is reduced. Also, I think (but I am not 
100% sure) that the guys in TFlite do not extremely optimize for armv7. Indeed, 
on armv7 @anijain2305 shows (in the same post you mention) a 0.80 ratio for 
tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, 
respectively ). 
* The Qnnpack post you mention explicitly says that: "the microkernel that 
leverages the dual issue capability proves to be 15 percent to 20 percent 
faster for a sufficiently large channel count (K > 64)"
* The way they do convolution (and gemm) in Qnnpack for armv8 is by using a 
combination of `smlal` and `smlal2` (plus a combination of `usubl` and 
`usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in 
armv7 they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler 
(which I am not familiar with) is able to generate such combinations for armv8. 
* I did not try other CPUs other than the a76. The point is that I am not using 
anything specific for that CPU, but only specific to the armv8 ISA.  
* I agree that in case of smaller convolutions (or depthwise convolutions) 
there are simpler algorithms that work as well (or even faster). I also agree 
in stacking multiple strategies and let TVM select the best. 

I will reply on the code in the following comment. 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722

Reply via email to