ibsidorenko opened a new pull request, #13080:
URL: https://github.com/apache/tvm/pull/13080

   Main goal of this commit is to improve performance for Hexagon target and 
preserve performance/accuracy for x86, GPU and etc. targets.
   
   "qnn.requantize" operation is lowered into the sequence of multiply, add, 
shift during QNN canonicalization pass if scale quantization parameter is 
vector of scalars. This commit adds new Relay per-channel/per-axis 
FixedPointMultiply operation and is used in "qnn.requantize" operation lowering.
   
   per-channel/per-axis FixedPointMultiply is implemented through 
tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides 
default implementation and generate HVX vmpye/vmpyo instruction (see 
_q_multiply_shift_per_axis_hexagon). For all other targets it uses default 
implementation (64 bits arithmetic).
   
   **Performance/accuracy measurement:**
   
   * _CPU(x86) target_: accuracy and performance are the same. For other 
targets should be the same (otherwise it is bug).
   
   * _Hexagon target_: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 
4.4 ms -> 0.5 ms) Unfortunately, in some cases output of requantize can differ 
by "1" compared to reference output. We need to consider this tradeoff between 
performance and accuracy when we choose this optimized implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to