cconvey commented on code in PR #12169: URL: https://github.com/apache/tvm/pull/12169#discussion_r938212761
########## python/tvm/topi/hexagon/slice_ops/max_pool2d.py: ########## @@ -0,0 +1,193 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# pylint: disable=invalid-name, unused-variable, unused-argument, too-many-locals + +""" Compute and schedule for avg_pool2d slice op + +Please note the following assumptions made by the implementation: + +1) The input must be padded in advance to account for 'padding'. In addition, + both input and output must be padded as per the physical buffer layout. +2) The current implementation assumes 'count_include_pad' to be 'True'. It can be + modified to support 'False' case but the element count for the pooling window + must be pre-computed and provided as an input to reduce the run-time overhead. +3) 'padding' is ignored. It must be handled outside of the sliced op. +4) Please note that this implementation will not work if the output includes any + physical layout related padding as it can result into out-of-bound access + for the input. +""" + +from tvm import te +from tvm import tir +from ..utils import get_layout_transform_fn + +import io +import sys +from typing import * + + +def validate_out_shape(out_shape, in_shape, kernel, stride, dilation): + """Validate output shape""" + _, oh, ow, _ = out_shape + _, ih, iw, _ = in_shape + kh, kw = kernel + sh, sw = stride + dh, dw = dilation + if ih < (oh - 1) * sh + dh * (kh - 1) + 1: + raise RuntimeError("Output height is too large") + if iw < (ow - 1) * sw + dw * (kw - 1) + 1: + raise RuntimeError("Output width is too large") + + +def max_pool2d_compute(A, out_shape, kernel, stride, dilation): + """max_pool2d compute""" + kh, kw = kernel + rh = te.reduce_axis((0, kh), name="rh") + rw = te.reduce_axis((0, kw), name="rw") + ob, oh, ow, oc = out_shape + if isinstance(ob, int): + validate_out_shape(out_shape, A.shape, kernel, stride, dilation) + + sh, sw = stride + dh, dw = dilation + + Max = te.compute( + out_shape, + lambda b, h, w, c: te.max( + A[b, h * sh + dh * rh, w * sw + dw * rw, c].astype(A.dtype), axis=[rh, rw] + ), + name="max", + ) + return Max + + +def STIR_schedule_nhwc_8h2w32c2w(outs, ins, output_layout: str, input_layout: str): + """Schedule for input and output layout nhwc-8h2w32c2w""" + func = te.create_prim_func([ins, outs]) + s = tir.Schedule(func) + + # NOTE!!! This scheduling logic is a work in progress. + # It is not known to ultimately result in near-optimal Hexagon performance. + # The schedule below strives to implement these heuristics: + # + # - Each 2048-byte chunk of the output tensor should be visited only once, if possible. + # + # - The resulting object code should use Hexagon v69's HVX SIMD units if at all possible. + # (The HVX SIMD registers are 2048 bytes long, so each "chunk" of the output tensor + # fits exactly in a single HVX SIMD register.) + + Max = s.get_block("max") + + input_transform_fn = get_layout_transform_fn(input_layout) + output_transform_fn = get_layout_transform_fn(output_layout) + + s.transform_layout(Max, ("read", 0), input_transform_fn) + s.transform_layout(Max, ("write", 0), output_transform_fn) + + # Restructure the loop nestings to have this overall structure: + # (loop over different 2048-byte output-tensor chunks) : n, ho, wo, co }- the first level of a two-level tensor layout + # (loop within one 2048-byte output-tensor chunk) : hi, wio, ci, wii }- the second level of a two-level tensor layout + # (loop over reduction axes) : rh, rw }- loop over multiple elements of the input tensor + # + # Note: This schedule is a work in progress. We *expect* that it's + # crucially important for the loops to have this relative ordering: + # n ... ho ... wo ... co ... hi ... wio ... ci ... wii + # because it lets us visit each of the 2048-byte output chunks precisely once. + + ( + n, + h, + w, + c, + rh, + rw, + ) = s.get_loops(Max) + + # Restructure the loops from NHWC to nhwc_8h2w32c2w, with loops for 'sum's reduction + # axes at the very end. + ho, hi = s.split(h, [None, 8]) + wo, wi = s.split(w, [None, 4]) + wio, wii = s.split(wi, [None, 2]) + co, ci = s.split(c, [None, 32]) + s.reorder(n, ho, wo, co, hi, wio, ci, wii, rh, rw) Review Comment: From talking with @Lunderberg , I got the impression that some or all of these loop manipulations were expected to already be done by the earlier call to `transform_layout`. TODO(cconvey): Figure out if my understanding was wrong, or there's a bug in transform_layout. EDIT: Just remembered that this code is dealing with S-TIR scheduling, not TE scheduling. I.e., we call `tir.schedule(...)` rather than `te.create_schedule(...)`. We currently expect the automatic loop reordering to happen when we call `layout_transform` on TE schedules, but to _not_ happen when we call `layout_transform` on _S-TIR_ schedules. So this code is actually working as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
