Anndrey24 commented on code in PR #16899:
URL: https://github.com/apache/tvm/pull/16899#discussion_r1572419116


##########
python/tvm/topi/arm_cpu/conv2d_gemm.py:
##########
@@ -478,23 +498,21 @@ def schedule_conv2d_gemm_native(cfg, s, out, final_out):
         s[C].tensorize(y_inner, gemm_acc)
         s[C].parallel(x_outer)
     else:
-        k_outer, k_inner = s[C].split(k, 4)
-        x_outer, y_outer, x_inner, y_inner = s[C].tile(x, y, x_factor=4, 
y_factor=y_tile_size)
-        y_inner_outer, y_inner_inner = s[C].split(y_inner, nparts=4)

Review Comment:
   It changes the "llvm.fmuladd" intrinsic that gets generated in the 
micro-kernel:
   
   - before: 16 x  `tail call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %113, 
<4 x float> %115, <4 x float> %68)` 
   - after: 4 x `tail call <16 x float> @llvm.fmuladd.v16f32(<16 x float> %62, 
<16 x float> %64, <16 x float> %38)`
   
   In both cases it lowers to an assembly micro-kernel which includes 64 x FMLA 
instructions and the same number of register loads, just in a different order. 
I've tried both version on around 15 models that have conv2d operators and the 
speedup only ranged from 0.99 to 1.01 so I'm fairly sure it doesn't have any 
performance impact, neither positive, nor negative.  
   
   My reasoning behind the change was to unify both the scalable and 
non-scalable scheduling for fp16/fp32 under a single "if" branch (which doesn't 
use `te.tile()`), while also making the schedule slightly easier to understand 
and reducing the size of the TIR / LLVM (fewer unrolled instructions).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to