[tvm-rfcs] branch main updated: [RFC] Pipeline Executor (#14)

comaniac Fri, 20 Aug 2021 09:41:57 -0700

This is an automated email from the ASF dual-hosted git repository.

comaniac pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-rfcs.git



The following commit(s) were added to refs/heads/main by this push:
     new 2d57c28  [RFC] Pipeline Executor (#14)
2d57c28 is described below

commit 2d57c28d55ab26587c748724c9eef4e0835d5ea8
Author: Hua Jiang <[email protected]>
AuthorDate: Fri Aug 20 09:41:49 2021 -0700

    [RFC] Pipeline Executor (#14)
    
    * add pipeline compute rfc.
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * address review comments.
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * address review comments.
    
    * address review comments.
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * Update rfcs/0012-pipeline-executor.md
    
    Co-authored-by: Cody Yu <[email protected]>
    
    * rename rfcs file name into 0014.
    
    Co-authored-by: hua jiang <[email protected]>
    Co-authored-by: Cody Yu <[email protected]>
---
 resources/pipeline-executor-arch.png           | Bin 0 -> 72676 bytes
 resources/pipeline-executor-pipeline.png       | Bin 0 -> 237086 bytes
 resources/pipeline-executor-runtime.png        | Bin 0 -> 90514 bytes
 resources/pipeline-executor-schedule.png       | Bin 0 -> 97559 bytes
 resources/pipeline-executor-subgraph-split.png | Bin 0 -> 156056 bytes
 resources/pipeline-executor.png                | Bin 0 -> 39948 bytes
 rfcs/0014-pipeline-executor.md                 | 236 +++++++++++++++++++++++++
 7 files changed, 236 insertions(+)

diff --git a/resources/pipeline-executor-arch.png 
b/resources/pipeline-executor-arch.png
new file mode 100644
index 0000000..3f91dd3
Binary files /dev/null and b/resources/pipeline-executor-arch.png differ
diff --git a/resources/pipeline-executor-pipeline.png 
b/resources/pipeline-executor-pipeline.png
new file mode 100644
index 0000000..a634c3a
Binary files /dev/null and b/resources/pipeline-executor-pipeline.png differ
diff --git a/resources/pipeline-executor-runtime.png 
b/resources/pipeline-executor-runtime.png
new file mode 100644
index 0000000..a9857d2
Binary files /dev/null and b/resources/pipeline-executor-runtime.png differ
diff --git a/resources/pipeline-executor-schedule.png 
b/resources/pipeline-executor-schedule.png
new file mode 100644
index 0000000..e3dcc83
Binary files /dev/null and b/resources/pipeline-executor-schedule.png differ
diff --git a/resources/pipeline-executor-subgraph-split.png 
b/resources/pipeline-executor-subgraph-split.png
new file mode 100644
index 0000000..d9e2937
Binary files /dev/null and b/resources/pipeline-executor-subgraph-split.png 
differ
diff --git a/resources/pipeline-executor.png b/resources/pipeline-executor.png
new file mode 100644
index 0000000..a7858ee
Binary files /dev/null and b/resources/pipeline-executor.png differ
diff --git a/rfcs/0014-pipeline-executor.md b/rfcs/0014-pipeline-executor.md
new file mode 100644
index 0000000..7a173d2
--- /dev/null
+++ b/rfcs/0014-pipeline-executor.md
@@ -0,0 +1,236 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+- Feature Name: Pipeline Executor
+- Start Date: 2021-07-30
+- RFC PR: [apache/tvm-rfcs#0014](https://github.com/apache/tvm-rfcs/pull/0014)
+- GitHub Issue: [apache/tvm#8596](https://github.com/apache/tvm/issues/8596)
+
+## 1. Summary
+
+
+This proposal introduces Pipeline Executor: A runtime executor that schedules
+a list of Relay modules in pipeline to achieve task level parallelism to 
improve
+computation throughput.
+
+## 2. Motivation
+
+
+
+Currently more and more edge device inference deployments happen on SOC 
devices.
+Since SOC devices have heterogeneous chipset like GPU, FPGA, CPU, DSP, etc. To 
reach the best
+performance, there is a requirement to run an ML network in these 
heterogeneous chipsets.
+However, currently graph executor does not have parallelism logic, and the 
existing data parallelism
+solution only supports parallel on homogeneous chipset(device). Then, the only 
way to do batch processing
+on heterogeneous devices with TVM is to treat a whole ML network as a schedule 
unit and run it on
+different heterogeneous devices, but that would cause latency issue (low speed 
chipset becomes the
+latency bottleneck for single data processing).
+
+Therefore, we need a runtime executor that can provide parallel scheduling 
functionality
+with a finer-grained schedule unit like subgraph (a group of operator with 
dependency relation)
+to be more efficient to use SOC heterogeneous hardware resource to achieve a 
better performance.
+
+
+### Benefits of Pipeline Executor
+
+There are three benefits for Pipeline Executor
+
+Pipeline Executor provides:
+* Compute a single network on multiple backends in parallel to improve 
performance.
+
+* Use RPC to perform distributed computation cross multiple remote devices.
+
+* Pipeline executor provide the capability to integrate non-DNN model function.
+
+## 3. Guide-level explanation
+Pipeline Executor is a runtime executor which implements pipeline execution 
logic for multiple
+subgraphs and relies on graph_executor for operator storage and execution.
+
+This section introduces the use case for Pipeline Executor.
+
+* 1. Manually split/partition Relay module to a list of Relay modules and 
generate modules configuration (automatic module splitting is out of scope of 
this RFC and will be a future work).
+* 2. Use pipeline_executor to build a pipeline module with the subgraphs and 
configuration.
+* 3. Use pipeline_executor to load the pipeline module to run network in 
pipeline parallelism mode.
+
+### 3.1. Manually Split relay module a list relay modules and generate modules 
configuration.
+
+```python
+
+mod1, mod2, mod3 = my_manual_partitioner(mod)
+pipe_cfg = PipelineModuleConfig()
+
+# Define pipeline inputs. Here I assume two inputs of mod1 and one input of 
mod3 are the pipeline inputs.
+pipe_cfg.inputs["data_0"] = (mod1, "data_0")
+pipe_cfg.inputs["data_1"] = (mod1, "data_1")
+pipe_cfg.inputs["data_2"] = (mod3, "data_0")
+
+# Define pipeline outputs to be the first output of mod3.
+pipe_cfg.outputs.append((mod3, 0))
+
+# Define connections.
+pipe_cfg.connect(mod1, 0, mod2, "data_0") # mod1.output(0) -> mod2.data_0
+pipe_cfg.connect(mod2, 0, mod3, "data_1") # mod2.output(0) -> mod3.data_1
+
+# Print config for debugging
+print(str(pipe_cfg))
+# Inputs:
+#   |- data_0: mod1.data_0
+#   |- data_1: mod1.data_1
+#   |- data_2: mod3.data_0
+# Outputs:
+#   |- mod3.output(0)
+# Connections:
+#   |- mod1.output(0) -> mod2.data_0
+#   |- mod2.output(0) -> mod3.data_1
+
+
+```
+
+### 3.2. Use pipeline_executor to build pipeline module with the said subgraph 
and configuration.
+
+The interface is mostly the same as the graph executor but accepts a pipeline 
configuration instead of a Relay module. Here is an example.
+
+```python
+
+# Use the config to build a pipeline executor
+with relay.build_config(opt_level=3):
+    lib = pipeline_executor.build_pipeline(pipe_cfg)
+
+```
+
+### 3.3. Use pipeline_executor to load pipeline module to run network in 
pipeline parallism mode.
+
+Pipeline executor works asynchronously. Unlike the blocking `run` API in graph 
executor,
+`run` API in pipeline executor is non-blocking. As a result, we could have the 
following scenario:
+
+1. set_input(): Push the input to the queue.
+2. run(): Launch a task with the first input in the queue.
+3. set_input(): Push the second input to the queue.
+4. set_input(): Push the third input to the queue.
+5. run(): Launch a task with the second input.
+6. get_output(): Get the output of the first input.
+7. run(): Launch a task with the third input. 
+8. get_output(): Get the output of the second input.
+9. get_output(): Get the output of the third input.
+
+As can be seen, `get_output()` can be called anytime to get the first 
available output in the result queue,
+and it will return an empty array if no output is ready.
+
+Following is one example:
+
+```python
+#...
+
+datas = []
+for _ in range(5):
+    # Each data includes 3 tensors (i.e., data_0, data_1, data_2 for the 
pipeline).
+    datas.append([np.full(shape[i], 0).astype("float32") for i in range(3)])
+
+# Feed all available inputs.
+for data in datas:
+    pipeline_module.set_input("data_0", data[0])
+    pipeline_module.set_input("data_1", data[1])
+    pipeline_module.set_input("data_2", data[2])
+    pipeline_module.run()
+
+# Get all outputs.
+while pipeline_module.has_next_output():
+  pipeline_outputs.append(pipeline_module.get_output())
+
+```
+
+## 4 Reference-level explanation
+This section introduces the underlying techniques for the pipeline executor.
+The figure below briefly illustrates the workflow of the system
+
+Pipeline executor architecture
+![meta-schedule-workflow](../resources/pipeline-executor-arch.png)
+
+Manually construct the subgraph
+![meta-schedule-workflow](../resources/pipeline-executor-subgraph-split.png)
+
+How pipeline executor runtime work
+![meta-schedule-workflow](../resources/pipeline-executor-runtime.png)
+
+The pipeline executor schedule logic
+![meta-schedule-workflow](../resources/pipeline-executor-schedule.png)
+
+The network pipeline compute effect
+![meta-schedule-workflow](../resources/pipeline-executor-pipeline.png)
+
+
+## 5. Drawbacks
+
+
+Pipeline executor currently needs manually subgraph splitting and 
configuration construction.
+Further graph splitting feature would do automatically split.
+
+## 6. Rationale and alternative
+
+
+whithout pipeline executor, current tvm still can run network in Heterogeneous 
hardware but
+that running is serialized instead of parallel run operator in different 
hardware
+
+
+
+## 7. Prior art
+
+
+**Schedule Primtive like Vectorize etc** the schedule primtive implement data 
parallism
+on same device.
+
+## 8. Unresolved questions
+
+
+Automatically split compute graph
+
+## 9. Future possibilities
+
+### Using Automatic Graph Split feature to construct pipeline subgraph and 
configuration.
+
+This feature not in this RFC scope. the logic as following.
+
+this future solution include 3 steps, 1. Operator Auto tune, 2. Graph 
dependency tree build and balance, 
+3. Graph Auto Tune. following are more detail.
+
+#### 1. Operator Auto Tune :
+
+* a. In operator Auto tune tune section, user would using existing tuning 
logic to tune the every operator,
+but the tune would separately and serialized happen in all target involved by 
pipeline executor.
+
+* b. After operator tune done , here can get performance data, for example , 
con2d_0 best perf in
+GPU is 3ms, in VTA is 2ms etc, this perf data would get used in later Graph 
dependency tree build
+balance step.
+
+#### 2. Graph dependency tree build balance
+
+* a. Initialize a DAG, the node of the DAG is subgraph, initially for a N node 
DAG, first [1, N -1] node mapping to
+[1 , N-1] layer(compute density operator and others) of original compute 
graph, the number N node is
+mapping to [N, M] layer , M here is the original compute layer number.
+
+* b. by using the perf data generated in 3.1.1.b , every dependency tree node 
can get a time consume value,
+the time consume value for difference node not at beginning is not same, then 
we call this DAG is not balanced in 
+weight of node, by using the way to adjust the node(subgraph) scope(how many 
operator in this node), we make
+every node of the DAG become same or value closed on weight(same time 
consume), then such DAG is a graph split
+solution,
+here we use DAG is to record the parent/child relation that child only can run 
after parent runned, and the scope
+adjustment only can hapen between parent and child.
+
+#### 3. Graph Auto Tune.
+* a. 2 can generate more than one subgraph split solution DAG, in this step, 
Graph Auto Tune would try these
+multiple solution to get best configuration.
+
+after 1. 2. 3. , here can get an automatic graph split configuration.

[tvm-rfcs] branch main updated: [RFC] Pipeline Executor (#14)

Reply via email to