[GitHub] [tvm-rfcs] masahi commented on a diff in pull request #80: Asynchronous stage in software pipeline

GitBox Thu, 23 Jun 2022 15:26:20 -0700


masahi commented on code in PR #80:
URL: https://github.com/apache/tvm-rfcs/pull/80#discussion_r905556477



##########
rfcs/0077-async-pipeline.md:
##########
@@ -0,0 +1,528 @@
+- Feature Name: Asynchronous stage in software pipeline
+- Authors: [Masahiro Masuda](https://github.com/masahi), [Wuwei 
Lin](https://github.com/vinx13/)
+- Start Date: (2022-06-17)
+
+# Summary
+This RFC proposes two TIR intrinsics and an additional annotation to the TIR 
software pipeline transform, to express asynchrony **within the device code**.
+Asynchrony is prevalent on the host (runtime) side, and this proposal is the 
first step toward bringing the notion of an asynchronous operation in the
+generated code.
+
+The most important component we should agree on is the model of 
synchronization: Coming up with a design that is general enough to be useful 
for diverse backends, while making sure that the chosen design can be 
translated to a low-level synchronization model of a particular backend, is 
highly non-trivial.
+The approach described in this document is motivated by a use case for NVIDIA 
GPUs, but we took some cares so that the design can be adopted by other 
backends. For example, if a backend has an asynchronous DMA engine, vector and 
tensor unit, we can specify that each of them runs asynchronously in different 
stages in a pipeline, with necessary synchronization between them.
+
+The proposed model may have diverged from conventional ones, but we believe 
that this is a good fit for the TIR software pipeline specifically.
+
+# Asynchronous stage in a software pipeline
+
+### Background: What is a software pipeline, and what does the TIR software 
pipeline transform do?
+
+Software pipeline is an optimization technique to improve instruction-level 
parallelism of a loop. For example, given this program:
+
+```python
+B = alloc([1])
+
+for i in range(16):
+    B[0] = A[i] + 1
+    C[i] = B[0] + 1
+```
+
+the goal is to overlap the execution of two statements in the loop body, by 
letting the two statements operate on different iterations of the loop. This 
way, the second statement would no longer depend on the completion of the first 
statement in the same iteration.
+
+The TIR software pipeline transform enables such transformation at the TIR 
level. We annotate the loop in the above program to specify, for each statement 
in the loop, the “stage” and the “order” in the pipeline:
+
+```python
+sch = ...
+sch.annotate(i, "software_pipeline_stage", [0, 1])
+sch.annotate(i, "software_pipeline_order", [0, 1])
+```
+
+Given the annotation above, the TIR software pipeline transform would break up 
the loop into three parts: prologue, pipeline body and epilogue. Different 
“stage” in the pipeline body become independent of each other, and the integer 
value of “stage” tells how many iterations each statement goes ahead of its 
consumer.
+
+```python
+B = alloc([2])
+
+# Prologue
+B[0] = A[0]
+
+# Body
+for i in range(15):
+    B[(i + 1) % 2] = A[i] + 1
+    C[i] = B[i % 2] + 1
+
+# Epilogue
+C[15] = B[1] + 1
+```
+
+The two statements in the body can potentially run in parallel, if the 
underlying HW supports out-of-order execution.
+
+### Making parallelism more explicit: Asynchronous pipeline
+
+What’s currently available today is, after all, a “software” pipeline: whether 
or not independent statements actually run in parallel is up to the underlying 
HW, and programmers have little control over it. Moreover, for in-order 
processors like Hexagon DSP, this transformation alone would not help.
+
+The goal of this work is to support HW-backed asynchrony in the pipeline. 
Asynchronous data movement is becoming increasingly important in GPU computing, 
and NPUs typically have multiple kinds of asynchronous units (DMA copy, vector 
& matrix compute etc). To exploit such hardware features, it’s essential that 
we express all kinds of available asynchronies in the IR.
+
+A user of the TIR software pipeline transform will be able to specify which 
data movement or compute block should become asynchronous by an additional 
annotation. For example, given the  annotation specifying that the first block 
in the pipeline be made async,
+
+```python
+for i in range(16):
+    B[0] = A[i] + 1
+    C[i] = B[0] + 1
+
+...
+sch.annotate(i, "software_pipeline_stage", [0, 1])
+...
+
+# "0" refers to the first element in te list [0, 1] above, i.e. the first block
+sch.annotate(i, "software_pipeline_async_stages", [0])
+```
+
+we generate the following IR. An asynchronous block is decorated with the 
`async_scope` attribute, and two intrinsics are inserted to express 
synchronization.
+
+```python
+B = alloc([2])
+
+# Prologue
+async_scope:
+   B[0] = A[0]
+   async_commit_stage(0)
+
+# Body
+for i in range(15):
+    async_scope:
+        B[(i + 1) % 2] = A[i] + 1
+    async_commit_stage(0)
+
+    async_wait_stage(0, 1)
+    C[i] = B[i % 2] + 1
+
+# Epilogue
+async_wait_stage(0, 0)
+C[15] = B[1] + 1
+```
+
+**Semantics of the proposed intrinsics**. “stage” refers to the same notion in 
the TIR software pipeline.
+- `async_commit_stage(i)` : Group one or more invocation of async operations, 
and “commit” them to the `i`-th stage. The exact interpretation of “committing” 
can be up to each backend, but informally it signifies that a group of async 
operations are now in-flight. The group of operations committed together is 
awaited as one chunk, and thus they constitute the granularity at which the 
synchronization intrinsic discussed next operates on.

Review Comment:
   Thanks, I think this is a great idea. We probably need a separate lowering 
pass for `commit_stage` to insert a target-specific commit at the right place, 
while currently the lowering is trivial (line-by-line change). But it is 
certainly doable. I'll try implement this while we resolve other discussion 
points. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] masahi commented on a diff in pull request #80: Asynchronous stage in software pipeline

Reply via email to