guberti commented on PR #12856: URL: https://github.com/apache/tvm/pull/12856#issuecomment-1259494443
### Why doesn't `SMLAD` improve performance? Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2 and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts however, the relevant x1::x2 values in the input tensor are not next to each other, though. Previously, we used a DSP-specific halfword packing instruction __PKHBT to fix this, and then called __SMLAD after - two instructions for two multiplies. However, there is a lesser-known non-DSP instruction SMLAxy that is present on all Cortex-M cores (see [docs](https://developer.arm.com/documentation/dui0068/b/ARM-Instruction-Reference/ARM-multiply-instructions/SMLAxy)). This instruction allows us to only read one 16-bit half of an int32 register while performing multiply accumulates, allowing us to skip the PKHBT instruction. Doing the multiplies this way is just as fast, while being way more versatile and simpler. This means it is _impossible_ to use `SMLAD` to speed up our performance for input tensors in `NHWC` format - nowhere in the input tensor is the relevant data in the correct format, which necessitates the use of at least one instruction to fix this. However, this extra instruction already removes all benefit of `SMLAD` compared to the non-DSP instruction `SMLAxy`. Note that for the `HCHW` format, `SMLAD` would be very helpful. We should look into changing the format in the Relay graph, as this would yield a major performance improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
