[GitHub] [tvm] guberti commented on pull request #13752: [microTVM] Use QNN schedules to give SOTA performance

via GitHub Tue, 24 Jan 2023 12:33:31 -0800


guberti commented on PR #13752:
URL: https://github.com/apache/tvm/pull/13752#issuecomment-1402606269


   ## Next steps
   
   The `137 ms` performance for the `vww` model is impressive, and beats the 
current state-of-the-art by a good margin. However, there is still a lot of 
room to improve our MLPerf Tiny performance even further:
   
   - Measure performance for the `ic` and `kws` MLPerf Tiny models. The changes 
in this pull request should dramatically improve performance on these as well, 
and we should be able to use them with only minor tinkering. I'm currently 
working on a follow-up PR to add this functionality.
   - Add support for **autotuning** to my tensordot schedules. Specifically, 
tuning `num_outputs` will give a substantial performance improvement with very 
little work. If we're willing to be a little more ambitious, we could use 
tuning to do reordering of the assembly code in the generated `tensordot` 
functions (this would especially help Cortex-M7 performance).
   - Add a second, word-unaligned copy of convolution kernels when it would 
help, and add support for this to `tensordot.py`.
   - Skip padding steps by folding padding into previous operators (this should 
be enabled by Relax).
   - See if we can use floor instead of rounding in `tensordot.py`'s 
requantization implementation. This should shave a couple of `ms` off the 
runtimes of `ic`, `kws`, and `vww`, but it might hurt accuracy slightly.
   - Write a Cortex-M schedule for `qnn_dense`. This will improve performance 
for `ic`, `kws`, and `vww` by a tiny amount, but it will dramatically improve 
`ad` performance (which currently still sucks).
   - Generalize `tensordot.py` to support Cortex-M CPUs _without_ the DSP 
extension. This would allow us to give good performance for Cortex M0, M0+, M1, 
and M3 devices (this PR only improves performance for M4 and M7).
   - Fix the bug with Arduino Cortex-M performance. Currently, this bug makes 
the Arduino implementation comically slow.
   
   _Note: adding proper Helium support would require rewriting our `tensordot` 
implementation, and re-writing our legalization and alter_op passes as well. 
Helium is very cool, but proper support would take a lot of effort._
   
   ## Generalization of changes
   
   Some of the `legalization` and `alter_op` changes would be useful very 
broadly in TVM, but are currently only enabled for Arm Cortex-M. This includes 
our output layout rewriting for `conv -> depthwise` convolution patterns, our 
stripping of empty channels from `conv2d` operator, and stripping `pad` out 
into a separate Relay operator (the last one only helps _in some cases_). 
However, I would want to write more general passes before doing this, and I'm 
not sure how these would interact with Relax. I'll hold off on this for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] guberti commented on pull request #13752: [microTVM] Use QNN schedules to give SOTA performance

Reply via email to