manupa-arm commented on a change in pull request #11:
URL: https://github.com/apache/tvm-rfcs/pull/11#discussion_r673872280



##########
File path: rfcs/0011_Arm_Ethos-U_Integration.md
##########
@@ -0,0 +1,233 @@
+    Feature Name: Arm® Ethos™-U Integration
+    Start Date: 2020 May
+    RFC PR: https://github.com/apache/tvm-rfcs/pull/11
+    GitHub Issue: https://github.com/apache/tvm/issues/8482
+
+# Motivation
+
+Arm® Ethos™-U is a series of NPUs that will enable low-cost and highly 
efficient AI solutions for a wide range of embedded devices. This RFC 
introduces the port of Ethos-U into the uTVM compilation flow. The process of 
compilation relies on the multiple levels of abstraction in TVM and a variety 
of analysis and optimisation passes to produce c output. In the process of 
compilation, we rely on the many levels of TVM's IR (and the passes) to perform 
optimizations to create c-sources that can work with current microTVM 
deployments.
+
+## Scope:
+
+### Ethos™-U55
+
+![](./assets/0011/ethosu_hw.png)
+
+Ethos™-U55 is a NPU that is designed to uplift ML performance by working as an 
offload target for micro-controllers. It can accelerate quantized ML operators 
such as Convolution2D, Depthwise Convolution, Pooling and Elementwise 
Operators.  For convolution-type operators, Ethos-U55 supports hardware enabled 
loseless de-compression of weights to increase inference performance and reduce 
power.
+
+The scope for this RFC is to add support for offloading to the Arm Ethos-U55 
NPU. The initial machine learning framework that we use for testing this is 
TensorFlow Lite. Future RFCs and pull requests will address additional NPUs, 
such as the Ethos-U65, and other frameworks as the port evolves.
+
+Please refer to Technical Reference Manual (TRM) for more details – 
https://developer.arm.com/documentation/102420/0200.
+* Reference : https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55
+
+# Guide-level explanation
+
+## TVMC User Interface
+```
+tvmc compile my_model.tflite 
+--executor=aot 
+--output-format=mlf 
+--target="ethos-u --accelerator-config=ethos-u55-xxx",c" ---> Model Library 
Format
+
+# where xxx could be out of possible configuration of the accelerator that can 
take values : [32, 64, 128, 256]
+```
+
+The users should be able to use the above command to compile to ethos-u55 that 
would generate Model Library Format(MLF) output.
+Please take a look at our provided example in the last PR (once its published).
+
+## Design Architecture Overview
+
+![](./assets/0011/ethosu_compiler_arch.png)
+
+We rely on the graph partitioning infrastructure in Relay (commonly known as 
BYOC) to integrate the Relay and TIR pass pipeline to generate c-source 
artifacts that could be used in an embedded deployment environment. Therefore, 
the generated c-sources are expected to be bundled with AOT executor in the 
Model Library Format (MLF) tarball. The embedded user can easily use MLF.tar as 
he/she would use it with the AOT executor in a typical embedded environment.
+
+### Why are the operators lowered to TIR before runtime.Module is created ?
+
+The two main reasons are as follows :
+
+#### Cascading-style performance and memory optimizations :
+
+Given the deterministic nature of the hardware, we intend to utilize the the 
TVM's scheduling language to perform inter and intra operator optimizations to 
reduce memory footprint while maintaining good performance.
+
+Please refer to this discuss post for more information : 
https://discuss.tvm.apache.org/t/rfc-cascade-scheduling/8119/8
+
+#### Unified static memory planning :
+
+Ethos™-U is a NPU that is aimed at running with microTVM. Therefore,  as with 
typical usecases of microTVM, Ethos™-U NPU will require aggressive memory 
optimizations by sharing buffers with intermediaries used by the CPU.
+We envision a flow to expose the TIR generated by Ethos™-U codegen to future 
unified static memory planner to be optimized. 
+
+For more information about the proposed unified static memory planner, please 
refer to this discuss post : 
https://discuss.tvm.apache.org/t/rfc-unified-static-memory-planning/10099.
+
+# Reference-level explanation
+
+## Compilation flow
+
+### C1. TVM Frontend and Partitioning
+
+The Relay graph as lowered from the TVM's frontend will be partitioned into 
Ethos-U subgraphs via running AnnotateTarget, MergeCompilerRegions and 
PartitionGraph Relay passes. Therefore, this procedure will result in the 
creation of "external" Relay functions that are re-directed to Ethos-U Relay 
and TIR pass pipeline for the creation of c-source as stated above.
+
+```
+# A Partitioned example for Conv2D
+
+def @main(%input: Tensor[(1, 300, 300, 3), int8]) -> Tensor[(1, 298, 298, 32), 
int8] {
+    @ethosu_0(%input) /* ty=Tensor[(1, 298, 298, 32), int8] */
+}
+
+def @ethosu_0(%ethosu_0_i0: Tensor[(1, 300, 300, 3), int8], Compiler="ethosu", 
...) {
+    %2 = fn (%FunctionVar_0_0: Tensor[(1, 300, 300, 3), int8],
+                
PartitionedFromPattern="qnn.conv2d_nn.bias_add_qnn.requantize_",
+                Composite="ethosu.qnn_conv2d") {
+        %0 = qnn.conv2d(%FunctionVar_0_0, meta[relay.Constant][0], -26, ...);
+        %1 = nn.bias_add(%0, meta[relay.Constant][2], axis=3);
+        qnn.requantize(%1, meta[relay.Constant][3], 0, 12341.8f, 0, 
out_dtype="int8")
+    };
+    %2(%ethosu_0_i0)
+}
+```
+
+
+### C2. Relay Legalization to Ethos™-U HW Primitive operations.
+
+In the design, we have decided to introduce TEs that closely describes the 
compute of each primitive operation that the hardware can natively execute – 
that we define as Ethos™-U HW primitive operations in their own Relay 
operators. Moreover, there are many Relay operators that could be lowered to 
the Ethos™-U HW primitives (e.g., dense could be legalized to a conv2d 
operator). This component will legalize the external Relay function to Ethos™-U 
HW primitive operations.
+
+Ethos™-U hardware supports per-channel quantization through via encoding a 
scale with each bias value. Thus, the weight scales are converted to that 
format and packed with the biases. Thereafter, the packed bias and scales are 
made to a constant input to the Relay operator.
+For more details, please refer to : 
https://developer.arm.com/documentation/102420/0200
+
+```
+# This is the above partitioned function legalized to ethosu.conv2d operator.
+
+fn (%ethosu_0_i0: Tensor[(1, 300, 300, 3), int8], ..., 
global_symbol="ethosu_0", Primitive=1) {
+    contrib.ethosu.conv2d(%ethosu_0_i0, meta[relay.Constant][0], 
meta[relay.Constant][1], -26, ...)
+}
+```
+
+
+### C3. Ethos™-U TE/TIR Compiler Passes
+
+At this stage, we should have a TE representation of all HW primitive 
operations that belong to the offloaded function. We will be scheduling the TE 
representation to TIR Primfunc that describes the intermediary storage and 
hardware operations that needed to be executed. In future, we are intending to 
add more TE/TIR passes make the Ethos™-U TE/TIR compiler perform memory and 
performance optimizations (See 
https://discuss.tvm.apache.org/t/rfc-cascade-scheduling/8119) . Therefore, its 
vital to have all the operations represented in TE/TIR. Its important to note 
that Ethos™-U hardware requires weights to be 'encoded' in a certain way to be 
readable by the hardware. Therefore, the weight encoding is performed here and 
represented in the TIR primfunc with post-encoding sizes as buffers.
+
+```
+primfn(placeholder_1: handle, placeholder_2: handle, placeholder_3: handle, 
ethosu_write_1: handle) -> ()
+    attr = {"global_symbol": "main", "tir.noalias": True}
+    buffers = {buffer: Buffer(buffer_2: Pointer(uint8), uint8, [320], []),
+                placeholder: Buffer(placeholder_4: Pointer(int8), int8, [1, 
300, 300, 3], []),
+                buffer_1: Buffer(buffer_3: Pointer(uint8), uint8, [1312], []),
+                ethosu_write: Buffer(ethosu_write_2: Pointer(int8), int8, [1, 
298, 298, 32], [])}
+    buffer_map = {placeholder_3: buffer, ethosu_write_1: ethosu_write, 
placeholder_2: buffer_1, placeholder_1: placeholder} {
+    attr [placeholder.global: Pointer(uint8)] "storage_scope" = "global";
+    allocate(placeholder.global, uint8, [1312]);
+    attr [placeholder.d.global: Pointer(uint8)] "storage_scope" = "global";
+    allocate(placeholder.d.global, uint8, [320]) {
+        @tir.call_extern("ethosu_copy", (uint8*)buffer_3[0], 1312, 
(uint8*)placeholder.global[0], dtype=handle)
+        @tir.call_extern("ethosu_copy", (uint8*)buffer_2[0], 320, 
(uint8*)placeholder.d.global[0], dtype=handle)
+        @tir.call_extern("ethosu_conv2d", "int8", 300, 300, 3, 300, 0, 300, 
(int8*)placeholder_4[0], ...)
+    }
+}
+```
+
+
+Given, that the complexity of this component, we'll be putting up a seperate 
RFC to describe the functionality of Ethos™-U TE/TIR Compiler in detail.
+
+### C4. Translating Ethos™-U TIR Primfuncs to C-sources that call to the 
Ethos™-U driver APIs to perform the execution.
+
+Ethos™-U hardware is used from the host CPU via invoking a driver API call 
with a command stream (a uNPU specific binary artefact) that describes the 
hardware operators that need to execute. This component will use the TIR 
Primfunc to extract the hardware operators and buffer information. Thereafter, 
we'll be using Arm® Vela (https://pypi.org/project/ethos-u-vela/) compiler's 
backend python APIs to convert the TIR Primfunc to a command stream. Finally, 
the generated command stream will be wrapped in a c-source that invokes it 
using the Ethos™-U driver APIs.

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to