[GitHub] [tvm] guberti commented on pull request #12969: [microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts

GitBox Wed, 05 Oct 2022 13:16:41 -0700


guberti commented on PR #12969:
URL: https://github.com/apache/tvm/pull/12969#issuecomment-1265401605


   In #12856, we discussed how `NHWC` was a bad format to use for 
`depthwise_conv2d` in microTVM (and likewise `NCHW` a bad format for regular 
`conv2d`). From this, one might ask the question:
   > Given a `conv2d` on Cortex-M4 with `n` groups, what are the optimal data 
and kernel layouts?
   
   When choosing these layouts, we _really_ want data that will be 
multiply-accumulated to be next to each other in memory. The primary reason is 
that doing this lets us use the `__SMLAD` instruction with minimal overhead, 
which does two multiply-accumulates with one instruction. The secondary reason, 
however, is to let us use `*ptr++` when reading both the input data and kernel 
as much as possible, as `*ptr++` is one instruction on Cortex-M.
   
   For depthwise convolutions, channels do not interact with each other at all, 
so they have no reason to be near each other in memory. This applies to both 
the input data and the kernel. Hence, `NCHW` is the optimal data layout and 
`OIHW` the optimal kernel layout. By similar reasoning, `NHWC` and `OHWI` are 
optimal for regular Conv2D operators. But we can generalize further - for a 
generalized Conv2D with `n` groups and `c` channels, the optimal layouts are 
`NCHWxc/OIHWxi` where `x = c / n`.
   
   Now, assume we are performing a generalized Conv2D with `n` groups and `c` 
channels, using data layout `NCHWxc` and kernel layout `OIHWxi`. For the 
`int16` case, to convolve one entire row (`width * channels / groups` 
individual parameters), all we need to do is copy/paste this code `width * 
channels / (2 * groups)` times (the `2` coming from the fact two `int16` values 
fit into an `int32`). 
   
   ```
   uint32_t tensor_batch = *tensor++;
   uint32_t kernel_batch = *kernel++;
   sum = __SMLAD(tensor_batch, kernel_batch, sum);
   ```
   
   This code **does not** depend on the number of groups, which allows us to 
use the same tensorize function for both regular and depthwise convolutions!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] guberti commented on pull request #12969: [microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts

Reply via email to