NicolaLancellotti commented on a change in pull request #37:
URL: https://github.com/apache/tvm-rfcs/pull/37#discussion_r719273376



##########
File path: rfcs/0037-arm-ethosu-cascading-planner.md
##########
@@ -0,0 +1,151 @@
+- Feature Name: Arm® Ethos™-U Cascading Planner
+- Start Date: 2021-09-22
+- RFC PR: [apache/tvm-rfcs#0037](https://github.com/apache/tvm-rfcs/pull/37)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+This feature builds upon the support introduced into TVM to compile for Arm® 
Ethos(TM)-U NPUs (see [RFC](https://github.com/apache/tvm-rfcs/pull/11)). In 
that work, we represent NPU operators using Tensor Expressions, allowing them 
be scheduled using TVM’s scheduling language. The Planner aims to take 
advantage of this by deriving scheduling strategies for graphs off-loaded to 
the NPU which are optimal in both performance and the memory required to run 
the network. The Planner primarily searches for inter-operator scheduling 
opportunities rather than intra-operator which is more common in both AutoTVM 
and the Auto-Scheduler. In particular, it seeks to leverage the technique of 
‘cascading’.
+
+# Motivation
+[motivation]: #motivation
+
+On deeply embedded devices, working memory in the form of SRAM is at a 
premium. It may typically be measured in kilobytes and this severely limits the 
selection of networks that can be run on such devices. For Arm® Ethos™-U NPUs, 
this becomes a significant issue as the NPU provides sufficient compute power 
to run larger models but the system may not have the memory necessary to 
execute them. The Cascading Planner seeks to alleviate this by rescheduling the 
networks such that they use less working memory and accordingly can take 
advantage of the considerable compute capabilities of the NPU.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+The Cascading Planner will act as a scheduler for the Arm(R) Ethos(TM)-U NPU 
compiler in TVM. This means it will make decisions about the order in which 
operations are executed, as well as which memories tensors are assigned to and 
whether they need copying between memories. It's primary objective is to reduce 
the required working memory needed to run the network until it fits into a 
user-specified memory budget. The behaviour of the Planner will be controlled 
by a number of user-configurable options which will ultimately be exposed all 
the way up to TVMC (exact interface TBD). However, it is expected that for most 
ordinary use-cases, the Planner will be invisible to the user as it will be 
configured with sensible defaults.
+
+Additionally, the Planner will rely on a description of the device's memory 
layout to know the bandwidth and size of the different sections. This allows 
the Planner to effectively optimize models such that they fit within the memory 
constraints while still giving good performance. It is expected that this 
information will be derived from the 'PoolInfo' mechanism that is proposed to 
be used by the Unified Static Memory Planner 
([RFC](https://github.com/apache/tvm-rfcs/pull/9)).
+
+Schedulers for the NPU are required to expose an interface of TE Schedule -> 
TE Schedule. Note that a side-channelled constant dictionary is required due to 
the inability to express constants in TE/TIR. Should progress be made in 
resolving that restriction, this side-channel would be dropped from the design. 
Refer to the following diagram to see where the Planner fits into the 
compilation flow for Ethos-U:
+
+![meta-schedule-workflow](../resources/planner-flow.png)
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Cascading
+
+Cascading is the term we use for a form of inter-operator scheduling by which 
a series of operations are executed together as a sequence of dependent 
N-dimensional tiles. Those 'N-dimensional' tiles we refer to as 'stripes', 
corresponding to calculating only part of a tensor at a time. By employing this 
technique, we can reduce the amount of memory required to run the network. The 
following diagram demonstrates a simple example of this for a small network 
containing only two NHWC convolutions. It considers both the 'standard' 
sequential way of scheduling where operators are executed one after the other, 
and a cascaded scheduling using an output stripe size of (1, 3, 3, 4). Also 
note that the diagram omits the weight and bias tensors.
+
+![meta-schedule-workflow](../resources/cascading-diagram.png)
+
+Depending on the choice of stripe sizes, cascading groups and the network 
architecture, the memory saved can be substantial. For example, for Mobilenetv1 
when run naively it requires in excess of 1.2MB of memory. However, intelligent 
application of the cascading technique can reduce this to around 300kB making 
it much more accessible to deeply embedded devices. In addition, where a 
hierarchical memory is present with faster and slower regions, reducing the 
memory requirement can improve performance by allowing the intermediate results 
to fit entirely within the fast memories or caches.
+
+## Affine Transforms
+
+Deciding on exactly which operators should be cascaded and with what striping 
parameters is a non-trivial problem. Most local solutions are not globally 
optimal and the search space is very large. We therefore introduce a technique 
using affine transforms to allow for the quick calculation of the memory and 
performance of a given cascading option without having to perform a proper 
scheduling using TE/TIR.
+
+They key piece of information to calculate in order to characterize a cascade 
is how the stripe size changes throughout. This is a function of the data 
dependency between an operator's inputs and outputs. For many operators that 
we're interested in, an affine transform matrix can be used to represent this 
dependency if we represent the input and output stripe sizes as vectors. Affine 
transforms typically consider 'augmented' matrices and vectors 
(https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which 
allow for the representation of constant changes. Concretely, we define the 
transform matrix M as being the matrix for which the following holds:

Review comment:
       ```suggestion
   The key piece of information to calculate in order to characterize a cascade 
is how the stripe size changes throughout. This is a function of the data 
dependency between an operator's inputs and outputs. For many operators that 
we're interested in, an affine transform matrix can be used to represent this 
dependency if we represent the input and output stripe sizes as vectors. Affine 
transforms typically consider 'augmented' matrices and vectors 
(https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which 
allow for the representation of constant changes. Concretely, we define the 
transform matrix M as being the matrix for which the following holds:
   ```

##########
File path: rfcs/0037-arm-ethosu-cascading-planner.md
##########
@@ -0,0 +1,151 @@
+- Feature Name: Arm® Ethos™-U Cascading Planner
+- Start Date: 2021-09-22
+- RFC PR: [apache/tvm-rfcs#0037](https://github.com/apache/tvm-rfcs/pull/37)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+This feature builds upon the support introduced into TVM to compile for Arm® 
Ethos(TM)-U NPUs (see [RFC](https://github.com/apache/tvm-rfcs/pull/11)). In 
that work, we represent NPU operators using Tensor Expressions, allowing them 
be scheduled using TVM’s scheduling language. The Planner aims to take 
advantage of this by deriving scheduling strategies for graphs off-loaded to 
the NPU which are optimal in both performance and the memory required to run 
the network. The Planner primarily searches for inter-operator scheduling 
opportunities rather than intra-operator which is more common in both AutoTVM 
and the Auto-Scheduler. In particular, it seeks to leverage the technique of 
‘cascading’.
+
+# Motivation
+[motivation]: #motivation
+
+On deeply embedded devices, working memory in the form of SRAM is at a 
premium. It may typically be measured in kilobytes and this severely limits the 
selection of networks that can be run on such devices. For Arm® Ethos™-U NPUs, 
this becomes a significant issue as the NPU provides sufficient compute power 
to run larger models but the system may not have the memory necessary to 
execute them. The Cascading Planner seeks to alleviate this by rescheduling the 
networks such that they use less working memory and accordingly can take 
advantage of the considerable compute capabilities of the NPU.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+The Cascading Planner will act as a scheduler for the Arm(R) Ethos(TM)-U NPU 
compiler in TVM. This means it will make decisions about the order in which 
operations are executed, as well as which memories tensors are assigned to and 
whether they need copying between memories. It's primary objective is to reduce 
the required working memory needed to run the network until it fits into a 
user-specified memory budget. The behaviour of the Planner will be controlled 
by a number of user-configurable options which will ultimately be exposed all 
the way up to TVMC (exact interface TBD). However, it is expected that for most 
ordinary use-cases, the Planner will be invisible to the user as it will be 
configured with sensible defaults.
+
+Additionally, the Planner will rely on a description of the device's memory 
layout to know the bandwidth and size of the different sections. This allows 
the Planner to effectively optimize models such that they fit within the memory 
constraints while still giving good performance. It is expected that this 
information will be derived from the 'PoolInfo' mechanism that is proposed to 
be used by the Unified Static Memory Planner 
([RFC](https://github.com/apache/tvm-rfcs/pull/9)).
+
+Schedulers for the NPU are required to expose an interface of TE Schedule -> 
TE Schedule. Note that a side-channelled constant dictionary is required due to 
the inability to express constants in TE/TIR. Should progress be made in 
resolving that restriction, this side-channel would be dropped from the design. 
Refer to the following diagram to see where the Planner fits into the 
compilation flow for Ethos-U:
+
+![meta-schedule-workflow](../resources/planner-flow.png)
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Cascading
+
+Cascading is the term we use for a form of inter-operator scheduling by which 
a series of operations are executed together as a sequence of dependent 
N-dimensional tiles. Those 'N-dimensional' tiles we refer to as 'stripes', 
corresponding to calculating only part of a tensor at a time. By employing this 
technique, we can reduce the amount of memory required to run the network. The 
following diagram demonstrates a simple example of this for a small network 
containing only two NHWC convolutions. It considers both the 'standard' 
sequential way of scheduling where operators are executed one after the other, 
and a cascaded scheduling using an output stripe size of (1, 3, 3, 4). Also 
note that the diagram omits the weight and bias tensors.
+
+![meta-schedule-workflow](../resources/cascading-diagram.png)
+
+Depending on the choice of stripe sizes, cascading groups and the network 
architecture, the memory saved can be substantial. For example, for Mobilenetv1 
when run naively it requires in excess of 1.2MB of memory. However, intelligent 
application of the cascading technique can reduce this to around 300kB making 
it much more accessible to deeply embedded devices. In addition, where a 
hierarchical memory is present with faster and slower regions, reducing the 
memory requirement can improve performance by allowing the intermediate results 
to fit entirely within the fast memories or caches.
+
+## Affine Transforms
+
+Deciding on exactly which operators should be cascaded and with what striping 
parameters is a non-trivial problem. Most local solutions are not globally 
optimal and the search space is very large. We therefore introduce a technique 
using affine transforms to allow for the quick calculation of the memory and 
performance of a given cascading option without having to perform a proper 
scheduling using TE/TIR.
+
+They key piece of information to calculate in order to characterize a cascade 
is how the stripe size changes throughout. This is a function of the data 
dependency between an operator's inputs and outputs. For many operators that 
we're interested in, an affine transform matrix can be used to represent this 
dependency if we represent the input and output stripe sizes as vectors. Affine 
transforms typically consider 'augmented' matrices and vectors 
(https://en.wikipedia.org/wiki/Affine_transformation#Augmented_matrix) which 
allow for the representation of constant changes. Concretely, we define the 
transform matrix M as being the matrix for which the following holds:
+
+![meta-schedule-workflow](../resources/cascading-formula-1.png)
+
+Let's briefly consider how to derive such a transform matrix for a 3x3 
unstrided, undilated and unpadded NHWC convolution. Immediately, the '3x3' 
kernel tells us something important: a single element in the output depends on 
3x3 elements in the height/width of the input. If we were instead to consider a 
2x2 region of the output in the height/width dimensions, we'd then need a 4x4 
region in the input. So in general, the rule is that we need 2 more elements in 
height and width when calculating the dependencies of an output stripe. It can 
be shown that more generally this number is the kernel_size-1 in each axis. Now 
to consider the channels, in a convolution no matter how many output elements 
you are computing you'll always need every input channel. This is because the 
input channel axis is a reduction axis in a convolution, in a sense it isn't 
'reflected' in the output. Combining these two observations, we arrive at the 
following transform matrix:
+
+![meta-schedule-workflow](../resources/cascading-formula-2.png)
+
+The first row is simply an identity for the batch axis. The second and third 
rows are more interesting, corresponding the the data dependencies in the 
height and width axes. Here we see that the matrix is taking the output size 
and adding a fixed value of 2, as we described above. The fourth row is the 
channels axis, and as we expect this is constant and always 8 - the number of 
channels of the input tensor. The final row is simply the augmentation row and 
can be ignored.

Review comment:
       ```suggestion
   The first row is simply an identity for the batch axis. The second and third 
rows are more interesting, corresponding to the data dependencies in the height 
and width axes. Here we see that the matrix is taking the output size and 
adding a fixed value of 2, as we described above. The fourth row is the 
channels axis, and as we expect this is constant and always 8 - the number of 
channels of the input tensor. The final row is simply the augmentation row and 
can be ignored.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to