[GitHub] [tvm] guberti commented on pull request #12856: [microTVM] Generalize depthwise_conv2d schedule

GitBox Tue, 27 Sep 2022 06:17:32 -0700


guberti commented on PR #12856:
URL: https://github.com/apache/tvm/pull/12856#issuecomment-1259494443

### Why doesn't `SMLAD` improve performance?

Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2
and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts
however, the relevant x1::x2 values in the input tensor are not next to each
other, though. Previously, we used a DSP-specific halfword packing instruction
__PKHBT to fix this, and then called __SMLAD after - two instructions for two
multiplies.

However, there is a lesser-known non-DSP instruction SMLAxy that is present
on all Cortex-M cores (see
[docs](https://developer.arm.com/documentation/dui0068/b/ARM-Instruction-Reference/ARM-multiply-instructions/SMLAxy)).
This instruction allows us to only read one 16-bit half of an int32 register
while performing multiply accumulates, allowing us to skip the PKHBT
instruction. Doing the multiplies this way is just as fast, while being way
more versatile and simpler.

This means it is _impossible_ to use `SMLAD` to speed up our performance for
input tensors in `NHWC` format - nowhere in the input tensor is the relevant
data in the correct format, which necessitates the use of at least one
instruction to fix this. However, this extra instruction already removes all
benefit of `SMLAD` compared to the non-DSP instruction `SMLAxy`.

Note that for the `HCHW` format, `SMLAD` would be very helpful. We should
look into changing the format in the Relay graph, as this would yield a major
performance improvement.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] guberti commented on pull request #12856: [microTVM] Generalize depthwise_conv2d schedule

Reply via email to