[GitHub] [tvm] guberti opened a new pull request, #12856: [microTVM] Replace fancy depthwise_conv2d kernel packing scheme

GitBox Wed, 21 Sep 2022 03:47:25 -0700


guberti opened a new pull request, #12856:
URL: https://github.com/apache/tvm/pull/12856

For a while, I've intended to fix my depthwise_conv2d schedule so that its
unique weight repacking scheme happens at compile time instead of during
inference. While working on this though, I discovered that the `SMLAD`
instruction we use to compute multiplications in parallel does not actually
save time.

Recall that the `SMLAD` instruction takes two `int16*2` values `x1::x2` and
`y1::y2` and an accumulator `z`, and computes `z += x1 * y1 + x2 * y2`. For
`NHWC` layouts however, the relevant `x1::x2` values in the input tensor are
not next to each other, though. Previously, we used a DSP-specific halfword
packing instruction `__PKHBT` to fix this, and then called `__SMLAD` after -
two instructions for two multiplies. This is also what CMSIS-NN's [most
optimized depthwise convolution code
does](https://github.com/ARM-software/CMSIS_5/blob/dde5bac01b1b0b5ef528989a3139ce10bb1b054d/CMSIS/NN/Source/ConvolutionFunctions/arm_depthwise_conv_s8_opt.c#L319-L353).

However, there is a lesser-known non-DSP instruction `SMLAxy` that is
present on **all** Cortex-M cores (see
[docs](https://developer.arm.com/documentation/dui0068/b/ARM-Instruction-Reference/ARM-multiply-instructions/SMLAxy)).
This instruction allows us to only read one 16-bit half of an int32 register
while performing multiply accumulates, allowing us to skip the `PKHBT`
instruction. Doing the multiplies this way is just as fast, while being way
more versatile and simpler.

This PR changes the Cortex-M depthwise convolution setup to use `SMLAxy`
instead of `SMLAD`. It also removes the 3x3 kernel restriction, and the
complicated kernel packing mechanism. The net effect is that this schedule has
gotten slightly faster (about 10% for 3x3 kernels) for kernels with an odd
number of entries.

This is still a draft PR, as I need to remove an issue where `topi.reshape`
introduces redundant instructions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] guberti opened a new pull request, #12856: [microTVM] Replace fancy depthwise_conv2d kernel packing scheme

Reply via email to