giuseros edited a comment on pull request #5754: URL: https://github.com/apache/incubator-tvm/pull/5754#issuecomment-642520722
Hi @FrozenGene , Thanks a lot for your comments. I will address general replies here, and code comments in a separate reply. * I indeed read your discuss [post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4), but I thought the work was orthogonal to this one. My main goal here is to have a fast general convolution algorithm for Armv8-A. Your post talks about mobilenet v2, and raspi 3. * In mobilenet v2 there are no deep convolutional layers, mostly depthwise convolutions and 1x1 convolutions. With shallow convolutions the problem becomes memory bound, and the differences among the algorithms become less evident. That is also why I picked inception_v3, where there are 1x1, 3x3, 5x5, 1x7, 7x1 convolutions. * Raspi 3 comes with a 32bit operative system, which means using Armv7-A. The problem with Armv7-A is that instead of having 32 registers (as in Armv8-A) you have only 16, so the optimization space is reduced. Also, I think (but I am not 100% sure) that the guys in TFlite do not extremely optimize for Armv7-A. Indeed, on Armv7-A @anijain2305 shows (in the same post you mention) a 0.80 ratio for tflite/tvm (while I see a 0.60/0.30 ratio for multi/single thread scenarios, respectively ). * The Qnnpack post you mention explicitly says that: "the microkernel that leverages the dual issue capability proves to be 15 percent to 20 percent faster for a sufficiently large channel count (K > 64)" * The way they do convolution (and gemm) in Qnnpack for Armv8-A is by using a combination of `smlal` and `smlal2` (plus a combination of `usubl` and `usubl2`) while `conv2d_nhwc_spatial_pack` only uses `smal`. It is true that in armv7 they only use `vsmal` (and `vusubl`). So, I wonder if the autoscheduler (which I am not familiar with) is able to generate such combinations for armv8. * I did not try other CPUs other than the Cortex-A76. The point is that I am not using anything specific for that CPU, but only specific to the Armv8-A ISA. * I agree that in case of smaller convolutions (or depthwise convolutions) there are simpler algorithms that work as well (or even faster). I also agree in stacking multiple strategies and let TVM select the best. I will reply on the code in the following comment. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
