[GitHub] [incubator-tvm] giuseros edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

GitBox Thu, 11 Jun 2020 03:05:01 -0700


giuseros edited a comment on pull request #5754:
URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722



   Hi @FrozenGene ,
   Thanks a lot for your comments.  I will address general replies here, and 
code comments in a separate reply.
   
   * I indeed read your discuss 
[post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4),
 but I thought the work was orthogonal to this one. My main goal here is to 
have a fast general convolution algorithm for Armv8-A. Your post talks about 
mobilenet v2, and raspi 3. 
   * In mobilenet v2 there are no deep convolutional layers, mostly depthwise 
convolutions and 1x1 convolutions. With shallow convolutions the problem 
becomes memory bound, and the differences among the algorithms  become less 
evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 
1x7, 7x1 convolutions. 
   * Raspi 3 comes with a 32bit operative system, which means using Armv7-A. 
The problem with Armv7-A is that instead of having 32 registers (as in Armv8-A) 
you have only 16, so the optimization space is reduced. Also, I think (but I am 
not 100% sure) that the guys in TFlite do not extremely optimize for Armv7-A. 
Indeed, on Armv7-A @anijain2305 shows (in the same post you mention) a 0.80 
ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread 
scenarios, respectively ). 
   * The Qnnpack post you mention explicitly says that: "the microkernel that 
leverages the dual issue capability proves to be 15 percent to 20 percent 
faster for a sufficiently large channel count (K > 64)"
   * The way they do convolution (and gemm) in Qnnpack for Armv8-A is by using 
a combination of `smlal` and `smlal2` (plus a combination of `usubl` and 
`usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in 
Armv7-A they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler 
(which I am not familiar with) is able to generate such combinations for armv8. 
   * I did not try other CPUs other than the Cortex-A76. The point is that I am 
not using anything specific for that CPU, but only specific to the Armv8-A ISA. 
 
   * I agree that in case of smaller convolutions (or depthwise convolutions) 
there are simpler algorithms that work as well (or even faster). I also agree 
in stacking multiple strategies and let TVM select the best. 
   
   I will reply on the code in the following comment. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-tvm] giuseros edited a comment on pull request #5754: [RFC] Improve quantized convolution performance for armv8 architectures

Reply via email to