guberti opened a new pull request, #12448:
URL: https://github.com/apache/tvm/pull/12448

   Currently, our microTVM implementation of `depthwise_conv2d` uses the 
fallback schedule, and performance is subsequently terrible. This change adds a 
schedule for certain cases of `depthwise_conv2d` when it is run on a Cortex M4 
or M7 based chip (though I mainly thought about the M4). Almost all of the 
"big" performance speedups have been implemented, which should make our 
implementation faster than TFLite Micro and comparable to CMSIS-NN:
   - Performs 4x fewer memory loads than the fallback implementation by loading 
four `int8` values from the kernel and input tensor at a time. This is the main 
source of our speedup.
   - Uses a hand-written assembly micro-kernel utilizing the `__SMLAD` 
instruction to compute convolutions for four channels at once.
   - Uses a specialized kernel packing to remove four assembly instructions 
from the micro kernel. 
   - When `stride>1`, pads the kernel asymmetrically to slightly reduce the 
size of the padded tensor.
   
   However, in the interest of merging a PR I did not implement a few other 
optimizations. The most important one is that this schedule is not autotunable 
in any meaningful way (besides reordering a few loops). In an ideal world, we 
would use custom knobs to allow reordering of the instructions inside 
`QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP` (e.g. do we load the kernel from memory 
first, or perform halfword packs on our input tensor first?). This would 
improve performance on the M4 by a little bit, but I suspect would improve M7 
performance a lot.
   
   Additionally, I would have liked to handle the edges of the convolution with 
strip mining, instead of by padding the input tensor. This padding requires 
copying the entire tensor, and is therefor slow, but support for strip mining 
in TVM is pretty bad. A few other desired improvements:
   - Custom knobs for reordering instructions in micro kernel
   - Replace tensor padding with strip mining or something else
   - Use a specialized version of `QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP` for 
the entry in kernels with an odd number of entries (e.g. 3x3 kernels)
   - Generalize the micro kernel to support kernel sizes beyond 3x3
   - Similar to the above, remove other restrictions on the use of this micro 
kernel (e.g. support kernel dilation)
   - Allow requantization and ReLU instructions to be fused in a way that's not 
slow (next on my TODO list)
   
   I'm marking this PR as a draft for now, since there currently aren't any 
tests. Thanks to @areusch for his help with explaining how TVM does scheduling!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to