FranklandJack commented on code in PR #14519:
URL: https://github.com/apache/tvm/pull/14519#discussion_r1159934614


##########
python/tvm/topi/arm_cpu/depthwise_conv2d.py:
##########
@@ -394,7 +394,8 @@ def schedule_conv_out(out):
             ci_outer, ci_inner = s[out].split(ci, 4)
             s[out].vectorize(ci_inner)
             s[out].unroll(ci_outer)
-
+        else:
+            s[out].vectorize(ci)

Review Comment:
   This is a good point. I'm not sure what would happen if `ci` was huge, I 
guess LLVM would legalize the vectors and loop over the size of the vector 
registers in hardware when they are storing 32bit floats.
   
   I sort of just copied this from 
[above](https://github.com/apache/tvm/blob/main/python/tvm/topi/arm_cpu/depthwise_conv2d.py#L369)
 though using the same scheduling that we use for the actual convolution.
   
   Since the fallback scheduling is to use a loop splitting where `ci = 8` and 
for the hardware in question this is the number of 32bit float in a vector I 
think we should be okay. In the case we aren't using the fallback i.e. 
auto-scheduling I would hope the auto-scheduler is able to find the optimal 
choice for `ci` which should be at least as good as `8` or better, but perhaps 
there are scenarios where it could choose a bad value?
   
   We could do something similar to how we handle regular convolutions, and 
define a [tunable parameter for 
vectorization](https://github.com/apache/tvm/blob/main/python/tvm/topi/arm_cpu/conv2d_spatial_pack.py#L324),
 but that would potentially change the scheduling for the depthwise convolution 
here, rather than its output, which is sort of beyond the initial scope of this 
PR.
   
   What are your thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to