ibsidorenko opened a new pull request, #12659:
URL: https://github.com/apache/tvm/pull/12659

   This commit adds high-performance implementation of `fixed_point_multiply`
   operation based on Hexagon intrinsics for vmpye/vmpyo instructions.
   
   Benchmarking of `fixed_point_multiply` op with (1,8,56,56,32) input
   tensor on Qualcomm SM8350:
     * default implementation: **10.06 ms**
     * optimized implementation: **1.42 ms**
     * speedup: **7x** times (!!!)
   
   Please note that this is introducing a small round-up error for some
   corner cases with negative shift argument (The same as for ARM CPU, see
   [PR#5980](https://github.com/apache/tvm/pull/5980)). This is because we are 
rounding twice instead than only once:
     * original q_multiply_shift: round(x*y*2^-s)
     * hexagon q_multiply_shift: round(round(x*y)*2^-s)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to