MasterJH5574 commented on a change in pull request #8544:
URL: https://github.com/apache/tvm/pull/8544#discussion_r677072101
##########
File path: python/tvm/tir/schedule/schedule.py
##########
@@ -512,6 +512,150 @@ def after_inline(a: ty.handle, c: ty.handle) -> None:
########## Schedule: loop binding/annotation ##########
########## Schedule: cache read/write ##########
########## Schedule: reduction ##########
+ def rfactor(self, loop: LoopRV, factor_axis: int) -> LoopRV:
+ """Factorize an associative reduction block by the specified loop.
+
+ An associative reduction cannot be parallelized directly,
+ because it leads to potential race condition during accumulation.
+ Alternatively, the reduction could be factorized on a loop with the
following steps:
+ - Step 1: evenly slice the reduction into `n` separate chunks, where
`n` is the loop extent
+ - Step 2: compute the chunks separately and write the result into `n`
intermediate buffers;
+ - Step 3: accumulate the `n` separate buffer into the result buffer.
+ Note that the Step 2 above introduces opportunities for
parallelization.
+
+ RFactor is a schedule primitive that implements the transformation
described above:
+ Given a block that writes to buffer `B`, it factorizes a loop of
extent `n`.
+
+ For example, the pesudocode below accumulates `B[i] = sum(A[i, : , :
])`:
+
+
+ .. code-block:: python
+
+ for i in range(128): # loop i is a data
parallel loop
+ for j in range(128): # loop j is a reduction
loop
+ for k in range(128): # loop k is a reduction
loop
+ B[i] = B[i] + A[i, j, k]
+
+
+ Suppose RFactor is applied on the innermost loop `k` and `factor_axis
= 1`.
+ RFactor then creates an intermediate buffer and two blocks.
+
+ - The intermediate buffer, or "rf-buffer" is a buffer of rank `ndim(B)
+ 1` and
+ size `size(B) * n`, whose shape expands from `shape(B)` by adding an
axis of `n`
+ at the position specified by `factor_axis`. For example,
+
+ * shape(B) = [1, 2, 3], factor_axis = 0 => shape(B_rf) = [n, 1,
2, 3]
+ * shape(B) = [1, 2, 3], factor_axis = 1 => shape(B_rf) = [1, n,
2, 3]
+ * shape(B) = [1, 2, 3], factor_axis = 2 => shape(B_rf) = [1, 2,
n, 3]
+ * shape(B) = [1, 2, 3], factor_axis = 3 => shape(B_rf) = [1, 2,
3, n]
+
+ - The rfactor block, or "rf-block", is a block that writes to the
`rf-buffer` without
+ accumulating over the loop `k`, i.e. the loop `k` is converted from a
reduction loop
+ to a data parallel loop. In our example, the rf-block is:
+
+
+ .. code-block:: python
+
+ B_rf = np.zeros((128, 128)) # the rf-buffer
+ for k in range(128): # loop k is converted to a data
parallel loop
+ for i in range(128): # loop i is a data parallel loop
(unchanged)
+ for j in range(128): # loop j is a reduction loop
(unchanged)
+ B_rf[i, k] = B_rf[i, k] + A[i, j, k]
+
+
+ - The write-back block, or `wb-block`, is a block that accumulates the
rf-buffer into
+ the result buffer. All the reduction loops are removed except the loop
`k` for accumulation.
+ In our example, the wb-block is:
+
+ .. code-block:: python
+
+ for i in range(128): # loop i is a data parallel loop
(unchanged)
+ # loop j is removed because it is
a reduction loop
+ for k in range(128): # loop k is a reduction loop
(unchanged)
+ B[i] = B[i] + B_rf[i, k]
+
+ Parameters
+ ----------
+ loop : LoopRV
+ The loop outside block for which we want to do rfactor
+ factor_axis : int
+ The position where the new dimension is placed in the new
introduced rfactor buffer
+
+ Returns
+ -------
+ rf_block : BlockRV
+ The block which computes partial results over each slices (i.e.,
the first block
+ as described in the above illustration)
+
+ Examples
+ --------
+
+ Before rfactor, in TensorIR, the IR is:
+
+ .. code-block:: python
+
+ @tvm.script.tir
+ def before_rfactor(a: ty.handle, b: ty.handle) -> None:
+ A = tir.match_buffer(a, (128, 128, 128), "float32")
+ B = tir.match_buffer(b, (128,), "float32")
+ with tir.block([128, tir.reduce_axis(0, 128),
+ tir.reduce_axis(0, 128)], "B") as [vii, vi,
vj]:
+ with tir.init():
+ B[vii] = 0.0
+ B[vii] = B[vii] + A[vii, vi, vj]
+
+ Create the schedule and do rfactor:
+
+ .. code-block:: python
+
+ sch = tir.Schedule(before_rfactor)
+ _, _, k = sch.get_loops(sch.get_block("B"))
+ sch.rfactor(k, 0)
+ print(tvm.script.asscript(sch.mod["main"]))
+
+ After applying rfactor, the IR becomes:
+
+ .. code-block:: python
+
+ @tvm.script.tir
+ def after_rfactor(a: ty.handle, b: ty.handle) -> None:
+ A = tir.match_buffer(a, [128, 128, 128])
+ B = tir.match_buffer(b, [128])
Review comment:
Oh thanks! I slipped through this line. Usually we just leave out the
type specification in `tir.match_buffer`, so I'll remove those in the "before".
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]