[GitHub] [tvm] comaniac commented on a change in pull request #8544: [TensorIR][M2a] Reduction Factoring (RFactor)

GitBox Mon, 26 Jul 2021 18:08:43 -0700


comaniac commented on a change in pull request #8544:
URL: https://github.com/apache/tvm/pull/8544#discussion_r677036456




##########
File path: include/tvm/tir/schedule/schedule.h
##########
@@ -242,6 +242,26 @@ class ScheduleNode : public runtime::Object {
   /******** Schedule: loop binding/annotation ********/
   /******** Schedule: cache read/write ********/
   /******** Schedule: reduction ********/
+  /*!
+   * \brief Factorize an associative reduction block by the specified loop.
+   * \details An associative reduction cannot be parallelized directly,
+   * because it leads to potential race condition during accumulation.
+     Alternatively, the reduction could be factorized on a loop with the 
following steps:
+     - Step 1: evenly slice the reduction into `n` separate chunks, where `n` 
is the loop extent
+     - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+     - Step 3: accumulate the `n` separate buffer into the result buffer.
+     Note that the Step 2 above introduces opportunities for parallelization.
+     RFactor is a schedule primitive that implements the transformation 
described above.
+
+

Review comment:
       ```suggestion
      * because it leads to potential race condition during accumulation.
      * Alternatively, the reduction could be factorized on a loop with the 
following steps:
      * - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent;
      * - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
      * - Step 3: accumulate the `n` separate buffer into the result buffer.
      * Note that the Step 2 above introduces opportunities for parallelization.
      * RFactor is a schedule primitive that implements the transformation 
described above.
      *
      *
   ```

##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
     ########## Schedule: loop binding/annotation ##########
     ########## Schedule: cache read/write ##########
     ########## Schedule: reduction ##########
+    def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+        """Factorize an associative reduction block by the specified loop.
+
+        An associative reduction cannot be parallelized directly,
+        because it leads to potential race condition during accumulation.
+        Alternatively, the reduction could be factorized on a loop with the 
following steps:
+        - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent
+        - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+        - Step 3: accumulate the `n` separate buffer into the result buffer.
+        Note that the Step 2 above introduces opportunities for 
parallelization.
+
+        RFactor is a schedule primitive that implements the transformation 
described above:
+        Given a block that writes to buffer `B`, it factorizes a loop of 
extent `n`.
+
+        For example, the pesudocode below accumulates `B[i] = sum(A[i, : , : 
])`:
+
+

Review comment:
       remove the redundant blank line

##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
     ########## Schedule: loop binding/annotation ##########
     ########## Schedule: cache read/write ##########
     ########## Schedule: reduction ##########
+    def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+        """Factorize an associative reduction block by the specified loop.
+
+        An associative reduction cannot be parallelized directly,
+        because it leads to potential race condition during accumulation.
+        Alternatively, the reduction could be factorized on a loop with the 
following steps:
+        - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent
+        - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+        - Step 3: accumulate the `n` separate buffer into the result buffer.
+        Note that the Step 2 above introduces opportunities for 
parallelization.
+
+        RFactor is a schedule primitive that implements the transformation 
described above:
+        Given a block that writes to buffer `B`, it factorizes a loop of 
extent `n`.
+
+        For example, the pesudocode below accumulates `B[i] = sum(A[i, : , : 
])`:
+
+
+        .. code-block:: python
+
+            for i in range(128):                    # loop i is a data 
parallel loop
+                for j in range(128):                # loop j is a reduction 
loop
+                    for k in range(128):            # loop k is a reduction 
loop
+                        B[i] = B[i] + A[i, j, k]
+
+
+        Suppose RFactor is applied on the innermost loop `k` and `factor_axis 
= 1`.
+        RFactor then creates an intermediate buffer and two blocks.
+
+        - The intermediate buffer, or "rf-buffer" is a buffer of rank `ndim(B) 
+ 1` and
+        size `size(B) * n`, whose shape expands from `shape(B)` by adding an 
axis of `n`
+        at the position specified by `factor_axis`. For example,
+
+            * shape(B) = [1, 2, 3], factor_axis = 0  => shape(B_rf) = [n, 1, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 1  => shape(B_rf) = [1, n, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 2  => shape(B_rf) = [1, 2, 
n, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 3  => shape(B_rf) = [1, 2, 
3, n]
+
+        - The rfactor block, or "rf-block", is a block that writes to the 
`rf-buffer` without
+        accumulating over the loop `k`, i.e. the loop `k` is converted from a 
reduction loop
+        to a data parallel loop. In our example, the rf-block is:
+
+
+        .. code-block:: python
+
+            B_rf = np.zeros((128, 128))     # the rf-buffer
+            for k in range(128):            # loop k is converted to a data 
parallel loop
+                for i in range(128):        # loop i is a data parallel loop 
(unchanged)
+                    for j in range(128):    # loop j is a reduction loop 
(unchanged)
+                        B_rf[i, k] = B_rf[i, k] + A[i, j, k]
+
+
+        - The write-back block, or `wb-block`, is a block that accumulates the 
rf-buffer into
+        the result buffer. All the reduction loops are removed except the loop 
`k` for accumulation.
+        In our example, the wb-block is:
+
+        .. code-block:: python
+
+            for i in range(128):            # loop i is a data parallel loop 
(unchanged)
+                                            # loop j is removed because it is 
a reduction loop
+                for k in range(128):        # loop k is a reduction loop 
(unchanged)
+                    B[i] = B[i] + B_rf[i, k]
+
+        Parameters
+        ----------
+        loop : LoopRV
+            The loop outside block for which we want to do rfactor
+        factor_axis : int
+            The position where the new dimension is placed in the new 
introduced rfactor buffer
+
+        Returns
+        -------
+        rf_block : BlockRV
+            The block which computes partial results over each slices (i.e., 
the first block
+            as described in the above illustration)
+
+        Examples
+        --------
+
+        Before rfactor, in TensorIR, the IR is:
+
+        .. code-block:: python
+
+            @tvm.script.tir
+            def before_rfactor(a: ty.handle, b: ty.handle) -> None:
+                A = tir.match_buffer(a, (128, 128, 128), "float32")
+                B = tir.match_buffer(b, (128,), "float32")
+                with tir.block([128, tir.reduce_axis(0, 128),
+                                tir.reduce_axis(0, 128)], "B") as [vii, vi, 
vj]:
+                    with tir.init():
+                        B[vii] = 0.0
+                    B[vii] = B[vii] + A[vii, vi, vj]
+
+        Create the schedule and do rfactor:
+
+        .. code-block:: python
+
+            sch = tir.Schedule(before_rfactor)
+            _, _, k = sch.get_loops(sch.get_block("B"))
+            sch.rfactor(k, 0)
+            print(tvm.script.asscript(sch.mod["main"]))
+
+        After applying rfactor, the IR becomes:
+
+        .. code-block:: python
+
+            @tvm.script.tir
+            def after_rfactor(a: ty.handle, b: ty.handle) -> None:
+                A = tir.match_buffer(a, [128, 128, 128])
+                B = tir.match_buffer(b, [128])
+                B_rf = tir.alloc_buffer([128, 128])
+                with tir.block([128, 128, tir.reduce_axis(0, 128)], "B_rf") as 
[vi2, vii, vi]:
+                    with tir.init():
+                        B_rf[vi2, vii] = 0.0
+                    B_rf[vi2, vii] = (B_rf[vi2, vii] + A[vii, vi, vi2])
+                with tir.block([128, tir.reduce_axis(0, 128)], "B") as [vii_1, 
vi2_1]:
+                    with tir.init():
+                        B[vii_1] = 0.0
+                    B[vii_1] = (B[vii_1] + B_rf[vi2_1, vii_1])
+
+
+        Note
+        ----
+
+        Rfactor requires:
+        1) `loop` has only one child block, and it is a reduction block;
+        2) `loop` is a reduction loop, i.e. the loop variable is bound to only 
reduction variables
+        in the block binding;
+        3) `loop` is not parallelized, vectorized, unrolled or bound to any 
thread axis;
+        4) The block scope that `loop` is in is a staged-pipeline;
+        5) The outermost loop outside the reduction block should has the 
reduction block as its first child block;
+        6) The outermost reduction loop should have only one child block;
+        7) An unary extent loop that is not bound to any reduction or data 
parallel variables in the block binding
+        should not appear under some reduction loop;
+        8) The reduction block should write to only one buffer, and its init 
and body block only is

Review comment:
       I didn't understand this sentence: "and its init and body block only 
is...". What does "only" mean here?

##########
File path: tests/python/unittest/test_tir_schedule_reduction.py
##########
@@ -0,0 +1,675 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=missing-function-docstring,missing-module-docstring
+import pytest
+
+import numpy as np
+import tvm
+import tvm.testing
+from tvm import tir
+from tvm.script import ty
+
+# pylint: disable=no-member,invalid-name,unused-variable

Review comment:
       Merge to L17 or move to L18 if it's too long

##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
     ########## Schedule: loop binding/annotation ##########
     ########## Schedule: cache read/write ##########
     ########## Schedule: reduction ##########
+    def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+        """Factorize an associative reduction block by the specified loop.
+
+        An associative reduction cannot be parallelized directly,
+        because it leads to potential race condition during accumulation.
+        Alternatively, the reduction could be factorized on a loop with the 
following steps:
+        - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent
+        - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+        - Step 3: accumulate the `n` separate buffer into the result buffer.
+        Note that the Step 2 above introduces opportunities for 
parallelization.
+
+        RFactor is a schedule primitive that implements the transformation 
described above:
+        Given a block that writes to buffer `B`, it factorizes a loop of 
extent `n`.
+
+        For example, the pesudocode below accumulates `B[i] = sum(A[i, : , : 
])`:
+
+
+        .. code-block:: python
+
+            for i in range(128):                    # loop i is a data 
parallel loop
+                for j in range(128):                # loop j is a reduction 
loop
+                    for k in range(128):            # loop k is a reduction 
loop
+                        B[i] = B[i] + A[i, j, k]
+
+
+        Suppose RFactor is applied on the innermost loop `k` and `factor_axis 
= 1`.
+        RFactor then creates an intermediate buffer and two blocks.
+
+        - The intermediate buffer, or "rf-buffer" is a buffer of rank `ndim(B) 
+ 1` and
+        size `size(B) * n`, whose shape expands from `shape(B)` by adding an 
axis of `n`
+        at the position specified by `factor_axis`. For example,
+
+            * shape(B) = [1, 2, 3], factor_axis = 0  => shape(B_rf) = [n, 1, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 1  => shape(B_rf) = [1, n, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 2  => shape(B_rf) = [1, 2, 
n, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 3  => shape(B_rf) = [1, 2, 
3, n]
+
+        - The rfactor block, or "rf-block", is a block that writes to the 
`rf-buffer` without
+        accumulating over the loop `k`, i.e. the loop `k` is converted from a 
reduction loop
+        to a data parallel loop. In our example, the rf-block is:
+
+
+        .. code-block:: python
+
+            B_rf = np.zeros((128, 128))     # the rf-buffer
+            for k in range(128):            # loop k is converted to a data 
parallel loop
+                for i in range(128):        # loop i is a data parallel loop 
(unchanged)
+                    for j in range(128):    # loop j is a reduction loop 
(unchanged)
+                        B_rf[i, k] = B_rf[i, k] + A[i, j, k]
+
+

Review comment:
       ditto

##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
     ########## Schedule: loop binding/annotation ##########
     ########## Schedule: cache read/write ##########
     ########## Schedule: reduction ##########
+    def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+        """Factorize an associative reduction block by the specified loop.
+
+        An associative reduction cannot be parallelized directly,
+        because it leads to potential race condition during accumulation.
+        Alternatively, the reduction could be factorized on a loop with the 
following steps:
+        - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent
+        - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+        - Step 3: accumulate the `n` separate buffer into the result buffer.
+        Note that the Step 2 above introduces opportunities for 
parallelization.
+
+        RFactor is a schedule primitive that implements the transformation 
described above:
+        Given a block that writes to buffer `B`, it factorizes a loop of 
extent `n`.
+
+        For example, the pesudocode below accumulates `B[i] = sum(A[i, : , : 
])`:
+
+
+        .. code-block:: python
+
+            for i in range(128):                    # loop i is a data 
parallel loop
+                for j in range(128):                # loop j is a reduction 
loop
+                    for k in range(128):            # loop k is a reduction 
loop
+                        B[i] = B[i] + A[i, j, k]
+
+
+        Suppose RFactor is applied on the innermost loop `k` and `factor_axis 
= 1`.
+        RFactor then creates an intermediate buffer and two blocks.
+
+        - The intermediate buffer, or "rf-buffer" is a buffer of rank `ndim(B) 
+ 1` and
+        size `size(B) * n`, whose shape expands from `shape(B)` by adding an 
axis of `n`
+        at the position specified by `factor_axis`. For example,
+
+            * shape(B) = [1, 2, 3], factor_axis = 0  => shape(B_rf) = [n, 1, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 1  => shape(B_rf) = [1, n, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 2  => shape(B_rf) = [1, 2, 
n, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 3  => shape(B_rf) = [1, 2, 
3, n]
+
+        - The rfactor block, or "rf-block", is a block that writes to the 
`rf-buffer` without
+        accumulating over the loop `k`, i.e. the loop `k` is converted from a 
reduction loop
+        to a data parallel loop. In our example, the rf-block is:
+
+

Review comment:
       ditto

##########
File path: include/tvm/tir/stmt.h
##########
@@ -1361,6 +1362,24 @@ TVM_DLL PrimExpr TypeAnnotation(DataType dtype, Span 
span = Span());
 // overload printing of for type.
 TVM_DLL std::ostream& operator<<(std::ostream& os, ForKind kind);
 
+// inline implementations
+inline const char* ForKind2String(ForKind t) {
+  switch (t) {
+    case ForKind::kSerial:
+      return "serial";
+    case ForKind::kParallel:
+      return "parallel";
+    case ForKind::kVectorized:
+      return "vectorized";
+    case ForKind::kUnrolled:
+      return "unroll";
+    case ForKind::kThreadBinding:
+      return "thread_binding";
+  }
+  LOG(FATAL) << "Unknown ForKind";

Review comment:
       ```suggestion
     LOG(FATAL) << "Unknown ForKind " << t;
   ```

##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
     ########## Schedule: loop binding/annotation ##########
     ########## Schedule: cache read/write ##########
     ########## Schedule: reduction ##########
+    def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+        """Factorize an associative reduction block by the specified loop.
+
+        An associative reduction cannot be parallelized directly,
+        because it leads to potential race condition during accumulation.
+        Alternatively, the reduction could be factorized on a loop with the 
following steps:
+        - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent
+        - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+        - Step 3: accumulate the `n` separate buffer into the result buffer.
+        Note that the Step 2 above introduces opportunities for 
parallelization.
+
+        RFactor is a schedule primitive that implements the transformation 
described above:
+        Given a block that writes to buffer `B`, it factorizes a loop of 
extent `n`.
+
+        For example, the pesudocode below accumulates `B[i] = sum(A[i, : , : 
])`:
+
+
+        .. code-block:: python
+
+            for i in range(128):                    # loop i is a data 
parallel loop
+                for j in range(128):                # loop j is a reduction 
loop
+                    for k in range(128):            # loop k is a reduction 
loop
+                        B[i] = B[i] + A[i, j, k]
+
+

Review comment:
       ditto

##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
     ########## Schedule: loop binding/annotation ##########
     ########## Schedule: cache read/write ##########
     ########## Schedule: reduction ##########
+    def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+        """Factorize an associative reduction block by the specified loop.
+
+        An associative reduction cannot be parallelized directly,
+        because it leads to potential race condition during accumulation.
+        Alternatively, the reduction could be factorized on a loop with the 
following steps:
+        - Step 1: evenly slice the reduction into `n` separate chunks, where 
`n` is the loop extent
+        - Step 2: compute the chunks separately and write the result into `n` 
intermediate buffers;
+        - Step 3: accumulate the `n` separate buffer into the result buffer.
+        Note that the Step 2 above introduces opportunities for 
parallelization.
+
+        RFactor is a schedule primitive that implements the transformation 
described above:
+        Given a block that writes to buffer `B`, it factorizes a loop of 
extent `n`.
+
+        For example, the pesudocode below accumulates `B[i] = sum(A[i, : , : 
])`:
+
+
+        .. code-block:: python
+
+            for i in range(128):                    # loop i is a data 
parallel loop
+                for j in range(128):                # loop j is a reduction 
loop
+                    for k in range(128):            # loop k is a reduction 
loop
+                        B[i] = B[i] + A[i, j, k]
+
+
+        Suppose RFactor is applied on the innermost loop `k` and `factor_axis 
= 1`.
+        RFactor then creates an intermediate buffer and two blocks.
+
+        - The intermediate buffer, or "rf-buffer" is a buffer of rank `ndim(B) 
+ 1` and
+        size `size(B) * n`, whose shape expands from `shape(B)` by adding an 
axis of `n`
+        at the position specified by `factor_axis`. For example,
+
+            * shape(B) = [1, 2, 3], factor_axis = 0  => shape(B_rf) = [n, 1, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 1  => shape(B_rf) = [1, n, 
2, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 2  => shape(B_rf) = [1, 2, 
n, 3]
+            * shape(B) = [1, 2, 3], factor_axis = 3  => shape(B_rf) = [1, 2, 
3, n]
+
+        - The rfactor block, or "rf-block", is a block that writes to the 
`rf-buffer` without
+        accumulating over the loop `k`, i.e. the loop `k` is converted from a 
reduction loop
+        to a data parallel loop. In our example, the rf-block is:
+
+
+        .. code-block:: python
+
+            B_rf = np.zeros((128, 128))     # the rf-buffer
+            for k in range(128):            # loop k is converted to a data 
parallel loop
+                for i in range(128):        # loop i is a data parallel loop 
(unchanged)
+                    for j in range(128):    # loop j is a reduction loop 
(unchanged)
+                        B_rf[i, k] = B_rf[i, k] + A[i, j, k]
+
+
+        - The write-back block, or `wb-block`, is a block that accumulates the 
rf-buffer into
+        the result buffer. All the reduction loops are removed except the loop 
`k` for accumulation.
+        In our example, the wb-block is:
+
+        .. code-block:: python
+
+            for i in range(128):            # loop i is a data parallel loop 
(unchanged)
+                                            # loop j is removed because it is 
a reduction loop
+                for k in range(128):        # loop k is a reduction loop 
(unchanged)
+                    B[i] = B[i] + B_rf[i, k]
+
+        Parameters
+        ----------
+        loop : LoopRV
+            The loop outside block for which we want to do rfactor
+        factor_axis : int
+            The position where the new dimension is placed in the new 
introduced rfactor buffer
+
+        Returns
+        -------
+        rf_block : BlockRV
+            The block which computes partial results over each slices (i.e., 
the first block
+            as described in the above illustration)
+
+        Examples
+        --------
+
+        Before rfactor, in TensorIR, the IR is:
+
+        .. code-block:: python
+
+            @tvm.script.tir
+            def before_rfactor(a: ty.handle, b: ty.handle) -> None:
+                A = tir.match_buffer(a, (128, 128, 128), "float32")
+                B = tir.match_buffer(b, (128,), "float32")
+                with tir.block([128, tir.reduce_axis(0, 128),
+                                tir.reduce_axis(0, 128)], "B") as [vii, vi, 
vj]:
+                    with tir.init():
+                        B[vii] = 0.0
+                    B[vii] = B[vii] + A[vii, vi, vj]
+
+        Create the schedule and do rfactor:
+
+        .. code-block:: python
+
+            sch = tir.Schedule(before_rfactor)
+            _, _, k = sch.get_loops(sch.get_block("B"))
+            sch.rfactor(k, 0)
+            print(tvm.script.asscript(sch.mod["main"]))
+
+        After applying rfactor, the IR becomes:
+
+        .. code-block:: python
+
+            @tvm.script.tir
+            def after_rfactor(a: ty.handle, b: ty.handle) -> None:
+                A = tir.match_buffer(a, [128, 128, 128])
+                B = tir.match_buffer(b, [128])

Review comment:
       Add float32 to be consistent as the "before"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] comaniac commented on a change in pull request #8544: [TensorIR][M2a] Reduction Factoring (RFactor)

Reply via email to