guberti commented on PR #13752: URL: https://github.com/apache/tvm/pull/13752#issuecomment-1402606269
## Next steps The `137 ms` performance for the `vww` model is impressive, and beats the current state-of-the-art by a good margin. However, there is still a lot of room to improve our MLPerf Tiny performance even further: - Measure performance for the `ic` and `kws` MLPerf Tiny models. The changes in this pull request should dramatically improve performance on these as well, and we should be able to use them with only minor tinkering. I'm currently working on a follow-up PR to add this functionality. - Add support for **autotuning** to my tensordot schedules. Specifically, tuning `num_outputs` will give a substantial performance improvement with very little work. If we're willing to be a little more ambitious, we could use tuning to do reordering of the assembly code in the generated `tensordot` functions (this would especially help Cortex-M7 performance). - Add a second, word-unaligned copy of convolution kernels when it would help, and add support for this to `tensordot.py`. - Skip padding steps by folding padding into previous operators (this should be enabled by Relax). - See if we can use floor instead of rounding in `tensordot.py`'s requantization implementation. This should shave a couple of `ms` off the runtimes of `ic`, `kws`, and `vww`, but it might hurt accuracy slightly. - Write a Cortex-M schedule for `qnn_dense`. This will improve performance for `ic`, `kws`, and `vww` by a tiny amount, but it will dramatically improve `ad` performance (which currently still sucks). - Generalize `tensordot.py` to support Cortex-M CPUs _without_ the DSP extension. This would allow us to give good performance for Cortex M0, M0+, M1, and M3 devices (this PR only improves performance for M4 and M7). - Fix the bug with Arduino Cortex-M performance. Currently, this bug makes the Arduino implementation comically slow. _Note: adding proper Helium support would require rewriting our `tensordot` implementation, and re-writing our legalization and alter_op passes as well. Helium is very cool, but proper support would take a lot of effort._ ## Generalization of changes Some of the `legalization` and `alter_op` changes would be useful very broadly in TVM, but are currently only enabled for Arm Cortex-M. This includes our output layout rewriting for `conv -> depthwise` convolution patterns, our stripping of empty channels from `conv2d` operator, and stripping `pad` out into a separate Relay operator (the last one only helps _in some cases_). However, I would want to write more general passes before doing this, and I'm not sure how these would interact with Relax. I'll hold off on this for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
