areusch commented on a change in pull request #15:
URL: https://github.com/apache/tvm-rfcs/pull/15#discussion_r686176141



##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally 
applicable for all other networks that can be translated into Relay IR. TFLite 
test that contains just a quantized (int8) softmax is first converted as a 
sequence of following relay operations: *dequantize -> softmax -> quantize* by 
the TFLite frontend. Please refer to the relay code snippet below obtained from 
TFLite frontend.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  %0 = qnn.dequantize(%a, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* 
ty=Tensor[(1, 16, 16, 3), float32] */;
+  %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Here is the API to obtain the partitioned function aimed at CMSIS-NN.
+
+```python
+    # API to call CMSIS-NN partitioning
+    from tvm.relay.op.contrib import cmsisnn
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
+```
+
+Following code block shows the resultant IRModule.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  @tvmgen_default_cmsisnn_0(%a) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+
+def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], 
Inline=1, Compiler="cmsisnn", global_symbol="tvmgen_default_cmsisnn_0", 
Primitive=1) -> Tensor[(1, 16, 16, 3), int8] {
+  %2 = fn (%FunctionVar_0_0: Tensor[(1, 16, 16, 3), int8], 
PartitionedFromPattern="qnn.dequantize_nn.softmax_qnn.quantize_", 
Composite="cmsisnn.qnn_softmax") -> Tensor[(1, 16, 16, 3), int8] {
+    %0 = qnn.dequantize(%FunctionVar_0_0, 0.02f /* ty=float32 */, 64 /* 
ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+  };
+  %2(%cmsisnn_0_i0) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Above partitioned function is presented to the CMSIS-NN external code 
generator for *tir* generation using the TVM's build() API. 

Review comment:
       nit: capitalize TIR everywhere

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally 
applicable for all other networks that can be translated into Relay IR. TFLite 
test that contains just a quantized (int8) softmax is first converted as a 
sequence of following relay operations: *dequantize -> softmax -> quantize* by 
the TFLite frontend. Please refer to the relay code snippet below obtained from 
TFLite frontend.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  %0 = qnn.dequantize(%a, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* 
ty=Tensor[(1, 16, 16, 3), float32] */;
+  %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Here is the API to obtain the partitioned function aimed at CMSIS-NN.
+
+```python
+    # API to call CMSIS-NN partitioning
+    from tvm.relay.op.contrib import cmsisnn
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
+```
+
+Following code block shows the resultant IRModule.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  @tvmgen_default_cmsisnn_0(%a) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+
+def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], 
Inline=1, Compiler="cmsisnn", global_symbol="tvmgen_default_cmsisnn_0", 
Primitive=1) -> Tensor[(1, 16, 16, 3), int8] {
+  %2 = fn (%FunctionVar_0_0: Tensor[(1, 16, 16, 3), int8], 
PartitionedFromPattern="qnn.dequantize_nn.softmax_qnn.quantize_", 
Composite="cmsisnn.qnn_softmax") -> Tensor[(1, 16, 16, 3), int8] {
+    %0 = qnn.dequantize(%FunctionVar_0_0, 0.02f /* ty=float32 */, 64 /* 
ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+  };
+  %2(%cmsisnn_0_i0) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Above partitioned function is presented to the CMSIS-NN external code 
generator for *tir* generation using the TVM's build() API. 
+
+```python
+    # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
+    with tvm.target.Target("c -runtime=c --link-params -mcpu=cortex-m55 
--executor=aot --unpacked-api=1"):
+        factory = tvm.relay.build(cmsisnn_mod)
+
+```
+
+Intermediate *tir* looks like this:
+
+```python
+primfn(placeholder_1: handle, out_write_1: handle) -> ()
+    attr = {"global_symbol": "main", "tir.noalias": True}
+    buffers = {placeholder: Buffer(placeholder_1: Pointer(int8), int8, [1, 
300, 300, 3], []),
+                out_write: Buffer(out_write_1: Pointer(int8), int8, [1, 300, 
300, 3], [])}
+    buffer_map = {placeholder_1: placeholder_1, out_write_1: out_write_1} {
+    ...
+    allocate(placeholder.d.global, uint8, [1,300,300,3]) {
+        @tir.call_extern("cmsisnn_softmax_s8", ..., dtype=handle)
+    }
+}
+```
+In future, target hooks for `relay_to_tir` implemented as part of [Additional 
Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10) will be used to 
obtain the above tir for graph with softmax. These hooks provide us with the 
flexibility to reuse memory planning and much of the TVM's code generation 
capabilities.
+
+At last, code generator identifies the *tir* extern_call(s) and generates *c* 
code for softmax with the CMSIS-NN API for softmax int8.

Review comment:
       > At last,
   
   Specify where exactly in TVM you're talking about

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally 
applicable for all other networks that can be translated into Relay IR. TFLite 
test that contains just a quantized (int8) softmax is first converted as a 
sequence of following relay operations: *dequantize -> softmax -> quantize* by 
the TFLite frontend. Please refer to the relay code snippet below obtained from 
TFLite frontend.

Review comment:
       so kind of I think this entire section belongs in Guide-level 
explanation (after all, currently if using this from Python, rather than 
`tvmc`, the user is expected to invoke these APIs).

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot

Review comment:
       how should the user understand which version of CMSIS_NN to use with 
their projects? 

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally 
applicable for all other networks that can be translated into Relay IR. TFLite 
test that contains just a quantized (int8) softmax is first converted as a 
sequence of following relay operations: *dequantize -> softmax -> quantize* by 
the TFLite frontend. Please refer to the relay code snippet below obtained from 
TFLite frontend.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  %0 = qnn.dequantize(%a, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* 
ty=Tensor[(1, 16, 16, 3), float32] */;
+  %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Here is the API to obtain the partitioned function aimed at CMSIS-NN.
+
+```python
+    # API to call CMSIS-NN partitioning
+    from tvm.relay.op.contrib import cmsisnn
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
+```
+
+Following code block shows the resultant IRModule.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  @tvmgen_default_cmsisnn_0(%a) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+
+def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], 
Inline=1, Compiler="cmsisnn", global_symbol="tvmgen_default_cmsisnn_0", 
Primitive=1) -> Tensor[(1, 16, 16, 3), int8] {
+  %2 = fn (%FunctionVar_0_0: Tensor[(1, 16, 16, 3), int8], 
PartitionedFromPattern="qnn.dequantize_nn.softmax_qnn.quantize_", 
Composite="cmsisnn.qnn_softmax") -> Tensor[(1, 16, 16, 3), int8] {
+    %0 = qnn.dequantize(%FunctionVar_0_0, 0.02f /* ty=float32 */, 64 /* 
ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+  };
+  %2(%cmsisnn_0_i0) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Above partitioned function is presented to the CMSIS-NN external code 
generator for *tir* generation using the TVM's build() API. 
+
+```python
+    # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
+    with tvm.target.Target("c -runtime=c --link-params -mcpu=cortex-m55 
--executor=aot --unpacked-api=1"):
+        factory = tvm.relay.build(cmsisnn_mod)
+
+```
+
+Intermediate *tir* looks like this:

Review comment:
       I think this code and the following paragraph can remain in 
Reference-level Explanation. Can you add a brief description of the 
pattern_table defined 
[here](https://github.com/apache/tvm/pull/8653/files#diff-0009d3bd0a8b47f8846c09a67a869e23ec6988e914521a56f8cb9912168dd7e5R46)
 (for example, which operators will live there and which won't)? Also, can you 
explain any conditions under which a CMSIS-NN function might not be exposed to 
TVM (e.g. will functions such as `arm_nn_mat_mult_kernel_q7_q15_reordered` also 
be implemented, and is there a design of how to handle this at the relay level)?

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally 
applicable for all other networks that can be translated into Relay IR. TFLite 
test that contains just a quantized (int8) softmax is first converted as a 
sequence of following relay operations: *dequantize -> softmax -> quantize* by 
the TFLite frontend. Please refer to the relay code snippet below obtained from 
TFLite frontend.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  %0 = qnn.dequantize(%a, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* 
ty=Tensor[(1, 16, 16, 3), float32] */;
+  %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Here is the API to obtain the partitioned function aimed at CMSIS-NN.
+
+```python
+    # API to call CMSIS-NN partitioning
+    from tvm.relay.op.contrib import cmsisnn
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
+```
+
+Following code block shows the resultant IRModule.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  @tvmgen_default_cmsisnn_0(%a) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+
+def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], 
Inline=1, Compiler="cmsisnn", global_symbol="tvmgen_default_cmsisnn_0", 
Primitive=1) -> Tensor[(1, 16, 16, 3), int8] {
+  %2 = fn (%FunctionVar_0_0: Tensor[(1, 16, 16, 3), int8], 
PartitionedFromPattern="qnn.dequantize_nn.softmax_qnn.quantize_", 
Composite="cmsisnn.qnn_softmax") -> Tensor[(1, 16, 16, 3), int8] {
+    %0 = qnn.dequantize(%FunctionVar_0_0, 0.02f /* ty=float32 */, 64 /* 
ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+  };
+  %2(%cmsisnn_0_i0) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Above partitioned function is presented to the CMSIS-NN external code 
generator for *tir* generation using the TVM's build() API. 
+
+```python
+    # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
+    with tvm.target.Target("c -runtime=c --link-params -mcpu=cortex-m55 
--executor=aot --unpacked-api=1"):
+        factory = tvm.relay.build(cmsisnn_mod)
+
+```
+
+Intermediate *tir* looks like this:
+
+```python
+primfn(placeholder_1: handle, out_write_1: handle) -> ()
+    attr = {"global_symbol": "main", "tir.noalias": True}
+    buffers = {placeholder: Buffer(placeholder_1: Pointer(int8), int8, [1, 
300, 300, 3], []),
+                out_write: Buffer(out_write_1: Pointer(int8), int8, [1, 300, 
300, 3], [])}
+    buffer_map = {placeholder_1: placeholder_1, out_write_1: out_write_1} {
+    ...
+    allocate(placeholder.d.global, uint8, [1,300,300,3]) {
+        @tir.call_extern("cmsisnn_softmax_s8", ..., dtype=handle)
+    }
+}
+```
+In future, target hooks for `relay_to_tir` implemented as part of [Additional 
Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10) will be used to 
obtain the above tir for graph with softmax. These hooks provide us with the 
flexibility to reuse memory planning and much of the TVM's code generation 
capabilities.

Review comment:
       You'll produce this TIR either way, right? Just that right now, the TIR 
is compiled straight to runtime.Module, while later on, it'll be returned to 
the compile pipeline. can you clarify this in the sentence here?

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,118 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+https://arm-software.github.io/CMSIS_5/NN/html/index.html
+https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation

Review comment:
       I've made some suggestions to reorganize this section a bit to add 
details

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the 
CMSIS operators.
+
+```
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally 
applicable for all other networks that can be translated into Relay IR. TFLite 
test that contains just a quantized (int8) softmax is first converted as a 
sequence of following relay operations: *dequantize -> softmax -> quantize* by 
the TFLite frontend. Please refer to the relay code snippet below obtained from 
TFLite frontend.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  %0 = qnn.dequantize(%a, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* 
ty=Tensor[(1, 16, 16, 3), float32] */;
+  %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Here is the API to obtain the partitioned function aimed at CMSIS-NN.
+
+```python
+    # API to call CMSIS-NN partitioning
+    from tvm.relay.op.contrib import cmsisnn
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
+```
+
+Following code block shows the resultant IRModule.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  @tvmgen_default_cmsisnn_0(%a) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+
+def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], 
Inline=1, Compiler="cmsisnn", global_symbol="tvmgen_default_cmsisnn_0", 
Primitive=1) -> Tensor[(1, 16, 16, 3), int8] {
+  %2 = fn (%FunctionVar_0_0: Tensor[(1, 16, 16, 3), int8], 
PartitionedFromPattern="qnn.dequantize_nn.softmax_qnn.quantize_", 
Composite="cmsisnn.qnn_softmax") -> Tensor[(1, 16, 16, 3), int8] {
+    %0 = qnn.dequantize(%FunctionVar_0_0, 0.02f /* ty=float32 */, 64 /* 
ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, 
out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+  };
+  %2(%cmsisnn_0_i0) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Above partitioned function is presented to the CMSIS-NN external code 
generator for *tir* generation using the TVM's build() API. 
+
+```python
+    # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
+    with tvm.target.Target("c -runtime=c --link-params -mcpu=cortex-m55 
--executor=aot --unpacked-api=1"):
+        factory = tvm.relay.build(cmsisnn_mod)
+
+```
+
+Intermediate *tir* looks like this:
+
+```python
+primfn(placeholder_1: handle, out_write_1: handle) -> ()
+    attr = {"global_symbol": "main", "tir.noalias": True}
+    buffers = {placeholder: Buffer(placeholder_1: Pointer(int8), int8, [1, 
300, 300, 3], []),
+                out_write: Buffer(out_write_1: Pointer(int8), int8, [1, 300, 
300, 3], [])}
+    buffer_map = {placeholder_1: placeholder_1, out_write_1: out_write_1} {
+    ...
+    allocate(placeholder.d.global, uint8, [1,300,300,3]) {
+        @tir.call_extern("cmsisnn_softmax_s8", ..., dtype=handle)
+    }
+}
+```
+In future, target hooks for `relay_to_tir` implemented as part of [Additional 
Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10) will be used to 
obtain the above tir for graph with softmax. These hooks provide us with the 
flexibility to reuse memory planning and much of the TVM's code generation 
capabilities.
+
+At last, code generator identifies the *tir* extern_call(s) and generates *c* 
code for softmax with the CMSIS-NN API for softmax int8.
+
+Note: There are no changes required in config.cmake as the CMSIS-NN APIs 
corresponding to the operators are hard coded. The testing infrastructure links 
them to the CMSIS-NN library. Execution of the networks works similar to what 
has been described in [Arm Ethos-U Integration] 
(https://github.com/apache/tvm-rfcs/pull/11).
+
+Once the entire infrastructure for CMSIS-NN mapping is in place using softmax 
API, we will add more complex operations such as depthwise convolution and 
pooling gradually to both the graph partitioning and code generation 
infrastructure.
+
+
+# Testing
+
+As we introduce the operators, we will keep on adding individual unit tests. 
Once the operator support is partially completed, we will start adding network 
tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] 
(https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run 
these tests in the CI. Reference: [Arm Ethos-U Integration] 
(https://github.com/apache/tvm-rfcs/pull/11/files). There will be two kinds of 
checks a unit test would provide: one around the correctness of the partitioned 
function and the other around validity of the output from corstone-300 against 
native TVM output.

Review comment:
       Organize this section a bit more:
    - Open by describing that there are two types of tests so the organization 
is clear.
    - The last sentence describes the checks a unit test would perform. Can you 
move it up with the content about unit tests at top?
    - Can you describe what the integration tests are asserting?

##########
File path: rfcs/0015_Arm_CMSIS-NN_Integration.md
##########
@@ -0,0 +1,143 @@
+- Feature Name: [RFC] Use CMSIS-NN with TVM
+- Start Date: July 2021
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
+
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It 
consists of efficient kernels targeted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fulfill this integration would be graph 
partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M 
and are compliant with the quantization scheme used in Tensorflow Lite. They 
have been optimized for better performance and small memory footprint which is 
required on these embedded devices and it would make sense for TVM to reuse 
these while generating code for Cortex-M. They have been integrated with the 
TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's BYOC infrastructure allows for the partitioning and code generation 
using the external compiler. Partitioned subgraphs containing operator(s) 
targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which 
eventually become part of MLF.

Review comment:
       can you expand--does the CMSIS-NN source also get included in the MLF, 
or just the calls to function external to MLF?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to