[GitHub] [tvm] cconvey commented on a diff in pull request #12169: [hexagon][topi] add sliced max_pool2

GitBox Thu, 04 Aug 2022 13:22:41 -0700


cconvey commented on code in PR #12169:
URL: https://github.com/apache/tvm/pull/12169#discussion_r938212761



##########
python/tvm/topi/hexagon/slice_ops/max_pool2d.py:
##########
@@ -0,0 +1,193 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=invalid-name, unused-variable, unused-argument, 
too-many-locals
+
+""" Compute and schedule for avg_pool2d slice op
+
+Please note the following assumptions made by the implementation:
+
+1) The input must be padded in advance to account for 'padding'. In addition,
+   both input and output must be padded as per the physical buffer layout.
+2) The current implementation assumes 'count_include_pad' to be 'True'. It can 
be
+   modified to support 'False' case but the element count for the pooling 
window
+   must be pre-computed and provided as an input to reduce the run-time 
overhead.
+3) 'padding' is ignored. It must be handled outside of the sliced op.
+4) Please note that this implementation will not work if the output includes 
any
+   physical layout related padding as it can result into out-of-bound access
+   for the input.
+"""
+
+from tvm import te
+from tvm import tir
+from ..utils import get_layout_transform_fn
+
+import io
+import sys
+from typing import *
+
+
+def validate_out_shape(out_shape, in_shape, kernel, stride, dilation):
+    """Validate output shape"""
+    _, oh, ow, _ = out_shape
+    _, ih, iw, _ = in_shape
+    kh, kw = kernel
+    sh, sw = stride
+    dh, dw = dilation
+    if ih < (oh - 1) * sh + dh * (kh - 1) + 1:
+        raise RuntimeError("Output height is too large")
+    if iw < (ow - 1) * sw + dw * (kw - 1) + 1:
+        raise RuntimeError("Output width is too large")
+
+
+def max_pool2d_compute(A, out_shape, kernel, stride, dilation):
+    """max_pool2d compute"""
+    kh, kw = kernel
+    rh = te.reduce_axis((0, kh), name="rh")
+    rw = te.reduce_axis((0, kw), name="rw")
+    ob, oh, ow, oc = out_shape
+    if isinstance(ob, int):
+        validate_out_shape(out_shape, A.shape, kernel, stride, dilation)
+
+    sh, sw = stride
+    dh, dw = dilation
+
+    Max = te.compute(
+        out_shape,
+        lambda b, h, w, c: te.max(
+            A[b, h * sh + dh * rh, w * sw + dw * rw, c].astype(A.dtype), 
axis=[rh, rw]
+        ),
+        name="max",
+    )
+    return Max
+
+
+def STIR_schedule_nhwc_8h2w32c2w(outs, ins, output_layout: str, input_layout: 
str):
+    """Schedule for input and output layout nhwc-8h2w32c2w"""
+    func = te.create_prim_func([ins, outs])
+    s = tir.Schedule(func)
+
+    # NOTE!!! This scheduling logic is a work in progress.
+    # It is not known to ultimately result in near-optimal Hexagon performance.
+    # The schedule below strives to implement these heuristics:
+    #
+    # - Each 2048-byte chunk of the output tensor should be visited only once, 
if possible.
+    #
+    # - The resulting object code should use Hexagon v69's HVX SIMD units if 
at all possible.
+    #   (The HVX SIMD registers are 2048 bytes long, so each "chunk" of the 
output tensor
+    #   fits exactly in a single HVX SIMD register.)
+
+    Max = s.get_block("max")
+
+    input_transform_fn = get_layout_transform_fn(input_layout)
+    output_transform_fn = get_layout_transform_fn(output_layout)
+
+    s.transform_layout(Max, ("read", 0), input_transform_fn)
+    s.transform_layout(Max, ("write", 0), output_transform_fn)
+
+    # Restructure the loop nestings to have this overall structure:
+    # (loop over different 2048-byte output-tensor chunks) : n, ho, wo, co   
}- the first level of a two-level tensor layout
+    #    (loop within one 2048-byte output-tensor chunk) : hi, wio, ci, wii  
}- the second level of a two-level tensor layout
+    #        (loop over reduction axes) : rh, rw                             
}- loop over multiple elements of the input tensor
+    #
+    # Note: This schedule is a work in progress.  We *expect* that it's
+    # crucially important for the loops to have this relative ordering:
+    #    n ... ho ... wo ... co ... hi ... wio ... ci ... wii
+    # because it lets us visit each of the 2048-byte output chunks precisely 
once.
+
+    (
+        n,
+        h,
+        w,
+        c,
+        rh,
+        rw,
+    ) = s.get_loops(Max)
+
+    # Restructure the loops from NHWC to nhwc_8h2w32c2w, with loops for 'sum's 
reduction
+    # axes at the very end.
+    ho, hi = s.split(h, [None, 8])
+    wo, wi = s.split(w, [None, 4])
+    wio, wii = s.split(wi, [None, 2])
+    co, ci = s.split(c, [None, 32])
+    s.reorder(n, ho, wo, co, hi, wio, ci, wii, rh, rw)

Review Comment:
   From talking with @Lunderberg , I got the impression that some or all of 
these loop manipulations were expected to already be done by the earlier call 
to `transform_layout`.  
   
   TODO(cconvey): Figure out if my understanding was wrong, or there's a bug in 
transform_layout.
   
   EDIT: Just remembered that this code is dealing with S-TIR scheduling, not 
TE scheduling.  I.e., we call `tir.schedule(...)` rather than 
`te.create_schedule(...)`.
   
   We currently expect the automatic loop reordering to happen when we call 
`layout_transform` on TE schedules, but to _not_ happen when we call 
`layout_transform` on _S-TIR_ schedules.  So this code is actually working as 
expected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] cconvey commented on a diff in pull request #12169: [hexagon][topi] add sliced max_pool2

Reply via email to