guberti opened a new pull request, #12448: URL: https://github.com/apache/tvm/pull/12448
Currently, our microTVM implementation of `depthwise_conv2d` uses the fallback schedule, and performance is subsequently terrible. This change adds a schedule for certain cases of `depthwise_conv2d` when it is run on a Cortex M4 or M7 based chip (though I mainly thought about the M4). Almost all of the "big" performance speedups have been implemented, which should make our implementation faster than TFLite Micro and comparable to CMSIS-NN: - Performs 4x fewer memory loads than the fallback implementation by loading four `int8` values from the kernel and input tensor at a time. This is the main source of our speedup. - Uses a hand-written assembly micro-kernel utilizing the `__SMLAD` instruction to compute convolutions for four channels at once. - Uses a specialized kernel packing to remove four assembly instructions from the micro kernel. - When `stride>1`, pads the kernel asymmetrically to slightly reduce the size of the padded tensor. However, in the interest of merging a PR I did not implement a few other optimizations. The most important one is that this schedule is not autotunable in any meaningful way (besides reordering a few loops). In an ideal world, we would use custom knobs to allow reordering of the instructions inside `QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP` (e.g. do we load the kernel from memory first, or perform halfword packs on our input tensor first?). This would improve performance on the M4 by a little bit, but I suspect would improve M7 performance a lot. Additionally, I would have liked to handle the edges of the convolution with strip mining, instead of by padding the input tensor. This padding requires copying the entire tensor, and is therefor slow, but support for strip mining in TVM is pretty bad. A few other desired improvements: - Custom knobs for reordering instructions in micro kernel - Replace tensor padding with strip mining or something else - Use a specialized version of `QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP` for the entry in kernels with an odd number of entries (e.g. 3x3 kernels) - Generalize the micro kernel to support kernel sizes beyond 3x3 - Similar to the above, remove other restrictions on the use of this micro kernel (e.g. support kernel dilation) - Allow requantization and ReLU instructions to be fused in a way that's not slow (next on my TODO list) I'm marking this PR as a draft for now, since there currently aren't any tests. Thanks to @areusch for his help with explaining how TVM does scheduling! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
