guberti commented on PR #12856:
URL: https://github.com/apache/tvm/pull/12856#issuecomment-1259494443

   ### Why doesn't `SMLAD` improve performance?
   
   Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2 
and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts 
however, the relevant x1::x2 values in the input tensor are not next to each 
other, though. Previously, we used a DSP-specific halfword packing instruction 
__PKHBT to fix this, and then called __SMLAD after - two instructions for two 
multiplies.
   
   However, there is a lesser-known non-DSP instruction SMLAxy that is present 
on all Cortex-M cores (see 
[docs](https://developer.arm.com/documentation/dui0068/b/ARM-Instruction-Reference/ARM-multiply-instructions/SMLAxy)).
 This instruction allows us to only read one 16-bit half of an int32 register 
while performing multiply accumulates, allowing us to skip the PKHBT 
instruction. Doing the multiplies this way is just as fast, while being way 
more versatile and simpler.
   
   This means it is _impossible_ to use `SMLAD` to speed up our performance for 
input tensors in `NHWC` format - nowhere in the input tensor is the relevant 
data in the correct format, which necessitates the use of at least one 
instruction to fix this. However, this extra instruction already removes all 
benefit of `SMLAD` compared to the non-DSP instruction `SMLAxy`.
   
   Note that for the `HCHW` format, `SMLAD` would be very helpful. We should 
look into changing the format in the Relay graph, as this would yield a major 
performance improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to