manupa-arm commented on issue #8717:
URL: https://github.com/apache/tvm/issues/8717#issuecomment-914445825
Thanks for the investigation and the suggestions seems to improve the things
a bit.
1.) Would it be possible to share the TIR Primfunc that is the source for
this code generation ?
2.) In the first example, it seems we are unnecessarily maintaining a
feature map of int32 just to do that in another set of loops -- even with int16
cast.
```
for (int32_t ax0_ax1_fused_ax2_fused_ax3_fused = 0;
ax0_ax1_fused_ax2_fused_ax3_fused < 401408;
++ax0_ax1_fused_ax2_fused_ax3_fused) {
int32_t _1 = (int32_t)(((((0 != 0) ?
(((int64_t)(((int32_t*)DepthwiseConv2d)[(ax0_ax1_fused_ax2_fused_ax3_fused)] +
((int32_t*)placeholder2)[((ax0_ax1_fused_ax2_fused_ax3_fused & 127))])) <<
((int64_t)0)) :
((int64_t)(((int32_t*)DepthwiseConv2d)[(ax0_ax1_fused_ax2_fused_ax3_fused)] +
((int32_t*)placeholder2)[((ax0_ax1_fused_ax2_fused_ax3_fused & 127))]))) *
(int64_t)2080045879) + ((int64_t)1 << ((int64_t)((4 + 31) - 1)))) >>
((int64_t)(4 + 31)));
int32_t _2 = (_1) < (255) ? (_1) : (255);
((int16_t*)T_cast)[(ax0_ax1_fused_ax2_fused_ax3_fused)] =
((int16_t)((uint8_t)((_2) > (0) ? (_2) : (0))));
}
```
Do we know why HW feature dictates that whole feature map needs to be int16
(as opposed to doing a cast on the fly -- fused) -- especially in this case,
int16 tensor bleeding out the fused primitive relay operator that is not used
until the next operator ?
My question why we cant the consuming operator create int16 just at the
accumulator without needing the feature map wide int16 tensor.
cc: @anijain2305
3.) In the second example, can we get the fusion done as well ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]