ekalda commented on code in PR #14003:
URL: https://github.com/apache/tvm/pull/14003#discussion_r1113319104


##########
python/tvm/topi/arm_cpu/conv2d_spatial_pack.py:
##########
@@ -424,33 +441,22 @@ def schedule_conv2d_spatial_pack_nhwc(cfg, s, op, output):
         max_unroll=16,
         cfg=cfg,
     )
-    cfg["ann_spatial"].apply(
-        s, conv, [ohi, owi, oci], axis_lens=[OHI, OWI, OCI], max_unroll=16, 
cfg=cfg
-    )
-    if cfg["compat"].val < 2:
-        compat_axis = [owo, oco][cfg["compat"].val]  # pylint: disable=R1706
-        s[kernel_vec].compute_at(s[conv], compat_axis)
-        s[data_vec].compute_at(s[conv], compat_axis)
-
-    if not autotvm.GLOBAL_SCOPE.in_tuning:
-        # schedule kernel pack
-        oco, kh, kw, ic, oci = kernel_vec.op.axis
-        s[kernel_vec].vectorize(oci)
-        s[kernel_vec].unroll(ic)
-        if cfg["compat"].val == 2:
-            s[kernel_vec].parallel(oco)
-
-    # schedule data pack
+    cfg["ann_spatial"].apply(s, conv, [owi, oci], axis_lens=[OWI, OCI], 
max_unroll=16, cfg=cfg)
+
+    # schedule data_vec, data_pad and kernel_vec
+    compat_axis = [owo, oco][cfg["compat"].val]  # pylint: disable=R1706
+    s[kernel_vec].compute_at(s[conv], compat_axis)
+    s[data_vec].compute_at(s[conv], compat_axis)
+
+    # Inlining kernel vec brings a performance improvement, but the tuner 
seems to not
+    # like it, so inline only when we are using the fallback config
+    if cfg.is_fallback:
+        s[kernel_vec].compute_inline()

Review Comment:
   It is essentially about whether we create an additional intermediate buffer 
to store the reorganised weights data. This is how it looks like when we don't 
inline:
   ```
                   for ax2_outer in range(16):
                       for i2, i3 in T.grid(8, 72):
                           cse_var_1: T.int32 = i2 * 72
                           PadInput_1 = T.Buffer((576,), data=PadInput)
                           p0_1 = T.Buffer((1179648,), data=p0)
                           PadInput_1[cse_var_1 + i3] = 
p0_1[ax0_ax1_outer_fused * 9216 + ax2_outer * 576 + cse_var_1 + i3]
                       kernel_vec_1 = T.Buffer((1728,), data=kernel_vec)
                       for oco, ic, oci in T.grid(3, 72, 8):
                           p1_1 = T.Buffer((1728,), data=p1)
                           kernel_vec_1[oco * 576 + ic * 8 + oci] = p1_1[ic * 
24 + oco * 8 + oci]
                       conv_1 = T.Buffer((24,), "float32x8", data=conv)
                       for oco in range(3):
                           for owi_init in range(8):
                               conv_1[oco * 8 + owi_init] = 
T.Broadcast(T.float32(0), 8)
                           for ic, owi in T.grid(72, 8):
                               cse_var_2: T.int32 = oco * 8 + owi
                               PadInput_1 = T.Buffer((576,), data=PadInput)
                               conv_1[cse_var_2] = 
T.call_llvm_pure_intrin("float32x8", T.uint32(141), T.uint32(3), 
T.Broadcast(PadInput_1[owi * 72 + ic], 8), kernel_vec_1[oco * 576 + ic * 8:oco 
* 576 + ic * 8 + 8], conv_1[cse_var_2])
   ```
   and this is what happens when we do:
   ```
               for ax2_outer in range(16):
                   for i2, i3 in T.grid(8, 72):
                       cse_var_1: T.int32 = i2 * 72
                       PadInput_1 = T.Buffer((576,), data=PadInput)
                       p0_1 = T.Buffer((1179648,), data=p0)
                       PadInput_1[cse_var_1 + i3] = p0_1[ax0_ax1_outer_fused * 
9216 + ax2_outer * 576 + cse_var_1 + i3]
                   conv_1 = T.Buffer((24,), "float32x8", data=conv)
                   for oco in range(3):
                       for owi_init in range(8):
                           conv_1[oco * 8 + owi_init] = 
T.Broadcast(T.float32(0), 8)
                       for ic, owi in T.grid(72, 8):
                           cse_var_3: T.int32 = oco * 8
                           cse_var_2: T.int32 = cse_var_3 + owi
                           PadInput_1 = T.Buffer((576,), data=PadInput)
                           p1_1 = T.Buffer((1728,), data=p1)
                           conv_1[cse_var_2] = 
T.call_llvm_pure_intrin("float32x8", T.uint32(141), T.uint32(3), 
T.Broadcast(PadInput_1[owi * 72 + ic], 8), p1_1[ic * 24 + cse_var_3:ic * 24 + 
cse_var_3 + 8], conv_1[cse_var_2])
   ```
   The first one has a loop to fill up the `kernel_vec_1` buffer and we access 
this buffer when we calculate `conv_1` values, while in the second case we 
don't create `kernel_vec_1` and access weights data (in `p1_1` input variable) 
directly.



##########
python/tvm/topi/arm_cpu/conv2d_spatial_pack.py:
##########
@@ -302,9 +303,27 @@ def conv2d_spatial_pack_nhwc(cfg, data, kernel, strides, 
padding, dilation, out_
     )
 
     cfg.define_annotate("ann_reduce", [kh, kw], policy="try_unroll")
-    cfg.define_annotate("ann_spatial", [ohi, owi, oci], 
policy="try_unroll_vec")
+    cfg.define_annotate("ann_spatial", [owi, oci], policy="try_unroll_vec")
     # ====================================================================
 
+    # If there are no tuning records, use this config
+    if cfg.is_fallback:
+
+        def _tile_size(axis, candidates):
+            for candidate in candidates:
+                tiles_divisible_by_candidate = axis % candidate == 0
+                if tiles_divisible_by_candidate:
+                    return candidate
+            return 1
+
+        cfg["tile_oh"] = SplitEntity([-1, 1])
+        cfg["tile_ow"] = SplitEntity([-1, _tile_size(OW, [8, 4])])
+        cfg["tile_co"] = SplitEntity([-1, _tile_size(OC, [8, 4])])

Review Comment:
   Added a comment :) 
   
   There is no perfect config that works for all shapes of conv2d and all CPUs 
(that's why we tune), but chunks of data of 4 or 8 elements conveniently fit 
into vectors (the LLVM vectors of 8 elements get broken down to two vector 
instructions). That applies to float32 data though, I didn't analyze int8 
performance of these schedules. However, from what I can see, this is the only 
conv2d NHWC floating point schedule and the default one for this layout, whilst 
for int8 we use other schedules. I don't claim it's the best possible config, 
but it worked well on a range of common models I looked at and it is easy to 
change and play around with. 



##########
python/tvm/topi/arm_cpu/conv2d_spatial_pack.py:
##########
@@ -402,20 +421,18 @@ def schedule_conv2d_spatial_pack_nhwc(cfg, s, op, output):
     oho, ohi = cfg["tile_oh"].apply(s, output, oh)
     owo, owi = cfg["tile_ow"].apply(s, output, ow)
     s[output].reorder(n, oho, owo, oco, ohi, owi, oci)
-    cfg["ann_spatial"].apply(
-        s, output, [ohi, owi, oci], axis_lens=[OHI, OWI, OCI], max_unroll=16, 
cfg=cfg
-    )
-    cfg.define_knob("compat", [0, 1, 2])
-    if cfg["compat"].val < 2:
-        compat_axis = [owo, oco][cfg["compat"].val]  # pylint: disable=R1706
-        s[conv].compute_at(s[output], compat_axis)
+    cfg["ann_spatial"].apply(s, output, [owi, oci], axis_lens=[OWI, OCI], 
max_unroll=16, cfg=cfg)

Review Comment:
   Yes, I haven't looked other schedules in detail, but I assume there are 
opportunities to limit the search space in favour of having less failed 
attempts during tuning. By eyeballing the tuning results, it seemed like 
unrolling/vectorizing across outer axis never succeeded, unless the values of 
the tiles corresponding to the inner axis were 1 and being optimised out, 
essentially corresponding to not tiling. These kind of configs were not 
particularly performant though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to