memory scope planning

GitBox Mon, 08 Nov 2021 09:31:10 -0800


areusch commented on a change in pull request #38:
URL: https://github.com/apache/tvm-rfcs/pull/38#discussion_r744917237




##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result

Review comment:
       so is this constraining only the output of a particular subgraph (e.g. 
the subgraph can be actually implemented on a different device so long as a 
memory copy is done?)

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`

Review comment:
       just curious, because i've seen compiler_begin and compiler_end before 
but not many examples in complex programs: are these essentially a source-level 
annotation e.g. marking all Relay expressions between the two annotations as 
offloaded to a particular compiler? why shouldn't these be hierarchical e.g. 
CompilerBlock which contains the subgraph as a tree?

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.
+4. Device planning currently has no machinery to hoist adjacent expressions 
which share the same device
+   into their own Relay `Function`. For all our executors except VM that's 
unnecessary anyway since
+   all Relay expressions left over after lowering are interpreted by the 
runtime. However for AOT we
+   have to compile *all* Relay code for a particular target. Note the BOYC 
machinery does support this,
+   but for the purposes of redirecting the compilation flow entirely. We need 
a middle ground.
+5. The BYOC flow is not connected to the `Target` machinery in any way.
+6. The BYOC annotate/partition flow is very similar to the device 
annotate/rewrite flow. For comparison:
+
+   | Feature               | Device Planning            | BYOC                 
                           |
+   | --------------------- | -------------------------- | 
----------------------------------------------- |
+   | Source of annotations | `on_device`, `device_copy` | 
`FTVMAnnotateTarget`, `MergeComposite`+patterns |
+   | Target of planning    | DLDeviceType               | Toolchain name       
                           |
+   | Propagation           | Unification based          | Ad-hoc               
                           |
+   | Relay support         | Full                       | First-order, no ADTs 
                           |
+   | Delimiting            | insert `device_copy`       | insert 
`compiler_begin`, `compiler_end`         |
+   | Multiple per expr     | No                         | Yes (though always 
picks first)                 |
+   | Hoists into functions | No                         | Yes                  
                           |
+   | Customized heuristics | No                         | No                   
                           |
+
+   Taking the 'upper bound' of the two implementations seems ideal, especially 
to address issues 4 (limitation
+   of device planning) and 5 (limitation of BYOC) above.
+
+Our proposal is:
+1. We introduce a new FFI-friendly class to represent a *S*torage or 
*E*xecution *Scope*:
+
+   ```
+   class SEScope {
+     DLDeviceType device_type;
+     int virtual_device_id;
+     Target target;
+     String memory_scope;
+   }
+   ```
+
+   We allow each of these fields to be independently 'constrained' (ie have a 
specific value) or
+   'unconstrained' (no specific value for the field is known yet). In 
particular, it is valid for
+   an `SEScope` to contain only a `device_type`. However if the `target` field 
is defined then
+   `device_type` must equal `target->kind->device_type`.
+
+2. At this stage we leave the `memory_scope` field uninterpreted. For example, 
we don't attempt to
+   represent that, eg, `"global"` on a `kDLCPU` is the same memory area as 
`"host"` on a `kDLCUDA` and thus no
+   `device_copy` operation is required between those scopes. We'll pick this 
issue up again after
+   [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   has landed.
+
+3. The `on_device` and `device_copy` call attributes use `SEScope`s instead of 
integers. However the Python
+   bindings for these 'operators' continue to accept a `Device` for 
convenience. The machinery in `LowerTEPass`
+   which resolves `DLDeviceTypes` to `Targets` is moved up in the compilation 
flow and becomes part of
+   `PlanDevices`. In particular, any `SEScope` encountered during device 
planning is 'canonicalized' to fill
+   in a `Target` by the same lookup as we do today. This means we continue to 
support the easy shorthand of
+   referring to devices by the `DLDeviceType` alone. However, advanced users 
can supply a `SEScope` to these
+   operators which contains the exact `Target` to use.
+
+4. We rework device planning to be in terms of `SEScope`s instead of 
`DLDeviceTypes`. Two `SEScope`s
+   become special:
+    - We need a default scope for all primitive operators which are not 
otherwise
+      constrained to a particular scope.
+    - We need a scope for 'host-only' operations and data, such as for shapes 
and shape functions.
+      (Currently this is hardcoded to `kDLCPU`).
+
+5. We extend `PlanDevices` to be able to a) run *after* lowering and b) refine 
existing constraints.  It will
+   look inside calls to `PrimFunc`s and follow the chain:
+
+   ```
+   tir::PrimFunc.buffer_map -> tir::Buffer.data -> tir::Var.type_annotation -> 
PointerType.storage_scope -> String
+   ```
+
+   to discover the memory scope for each Relay argument. That scope will enter 
`SEScope`s and flow through the
+   existing unification machinery. The existing sub-pass in `PlanDevices` will 
insert `device_copy` calls
+   wherever sub-expressions disagree on their memory scope.
+
+   (An additional pass is planned to heuristically move `device_copy`s around, 
and eliminate redundant
+    copies, however that's outside the scope of this RFC.)
+
+6. We rework `PartitionGraph` to `PartitionBySEScope` to work on `SEScope` 
annotations instead of
+   `compiler_begin` and `compiler_end` annotations. Algorithmically it's not a 
big change -- maximal
+   sub-expressions which share the same `SEScope` (or a projection thereof, eg 
just the `target`) are hoisted
+   into global `Function`s. The function's `"result_se_scope"` attribute 
describes both the scope holding the
+   function's result *and* the `Target` for which the function is to be 
compiled.
+
+7. We allow `MergeComposite` to be used to insert `on_device` annotations, 
call it `MergeAndAnnotate`.
+
+8. (?) We rework `AnnotateTarget` to just look for `FTVMAnnotateTarget` 
operator attributes, call it
+   `AnnotateSEScopes`. When the function fires an `on_device` annotation is 
inserted. However since

Review comment:
       clarifying my understanding:
   ```suggestion
      `AnnotateSEScopes`. When `FTVMAnnotateSEScopes` returns true, an 
`on_device` annotation is inserted. However since
   ```

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.
+4. Device planning currently has no machinery to hoist adjacent expressions 
which share the same device
+   into their own Relay `Function`. For all our executors except VM that's 
unnecessary anyway since
+   all Relay expressions left over after lowering are interpreted by the 
runtime. However for AOT we
+   have to compile *all* Relay code for a particular target. Note the BOYC 
machinery does support this,
+   but for the purposes of redirecting the compilation flow entirely. We need 
a middle ground.
+5. The BYOC flow is not connected to the `Target` machinery in any way.
+6. The BYOC annotate/partition flow is very similar to the device 
annotate/rewrite flow. For comparison:
+
+   | Feature               | Device Planning            | BYOC                 
                           |
+   | --------------------- | -------------------------- | 
----------------------------------------------- |
+   | Source of annotations | `on_device`, `device_copy` | 
`FTVMAnnotateTarget`, `MergeComposite`+patterns |
+   | Target of planning    | DLDeviceType               | Toolchain name       
                           |
+   | Propagation           | Unification based          | Ad-hoc               
                           |
+   | Relay support         | Full                       | First-order, no ADTs 
                           |
+   | Delimiting            | insert `device_copy`       | insert 
`compiler_begin`, `compiler_end`         |
+   | Multiple per expr     | No                         | Yes (though always 
picks first)                 |
+   | Hoists into functions | No                         | Yes                  
                           |
+   | Customized heuristics | No                         | No                   
                           |
+
+   Taking the 'upper bound' of the two implementations seems ideal, especially 
to address issues 4 (limitation
+   of device planning) and 5 (limitation of BYOC) above.
+
+Our proposal is:
+1. We introduce a new FFI-friendly class to represent a *S*torage or 
*E*xecution *Scope*:
+
+   ```
+   class SEScope {
+     DLDeviceType device_type;
+     int virtual_device_id;
+     Target target;
+     String memory_scope;
+   }
+   ```
+
+   We allow each of these fields to be independently 'constrained' (ie have a 
specific value) or
+   'unconstrained' (no specific value for the field is known yet). In 
particular, it is valid for
+   an `SEScope` to contain only a `device_type`. However if the `target` field 
is defined then
+   `device_type` must equal `target->kind->device_type`.
+
+2. At this stage we leave the `memory_scope` field uninterpreted. For example, 
we don't attempt to
+   represent that, eg, `"global"` on a `kDLCPU` is the same memory area as 
`"host"` on a `kDLCUDA` and thus no
+   `device_copy` operation is required between those scopes. We'll pick this 
issue up again after
+   [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   has landed.
+
+3. The `on_device` and `device_copy` call attributes use `SEScope`s instead of 
integers. However the Python
+   bindings for these 'operators' continue to accept a `Device` for 
convenience. The machinery in `LowerTEPass`
+   which resolves `DLDeviceTypes` to `Targets` is moved up in the compilation 
flow and becomes part of
+   `PlanDevices`. In particular, any `SEScope` encountered during device 
planning is 'canonicalized' to fill
+   in a `Target` by the same lookup as we do today. This means we continue to 
support the easy shorthand of
+   referring to devices by the `DLDeviceType` alone. However, advanced users 
can supply a `SEScope` to these
+   operators which contains the exact `Target` to use.

Review comment:
       what would be roughly the deprecation plan here? eventually we ban all 
the inputs to the compiler which could refer to SEScope in terms of 
DLDeviceType and then tighten the typing requirements here? this would be a 
backwards-incompatible Relay change. cc @jroesch

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.
+4. Device planning currently has no machinery to hoist adjacent expressions 
which share the same device
+   into their own Relay `Function`. For all our executors except VM that's 
unnecessary anyway since
+   all Relay expressions left over after lowering are interpreted by the 
runtime. However for AOT we
+   have to compile *all* Relay code for a particular target. Note the BOYC 
machinery does support this,
+   but for the purposes of redirecting the compilation flow entirely. We need 
a middle ground.

Review comment:
       ```suggestion
      have to compile *all* Relay code for a particular target. Note the BYOC 
machinery does support this,
      but for the purposes of more accurately modeling offloaded compute in the 
main compilation flow, we need a middle ground.
   ```

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build

Review comment:
       would be good to clarify as they are also required at runtime to the 
executor ctor
   ```suggestion
   3. The user must also supply a list of `Target` objects to 
`tvm.relay.build`. The compiler uses that list to build
   ```

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated

Review comment:
       nit: sequentially is a bit misleading--maybe suggest
   ```suggestion
   TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
evaluated (in topological order)
   ```

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data

Review comment:
       "default" also is called "fallback," right?

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.

Review comment:
       i agree, but just clarifying we aren't using a single identifier to 
describe both the device and the memory scope?

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.
+4. Device planning currently has no machinery to hoist adjacent expressions 
which share the same device
+   into their own Relay `Function`. For all our executors except VM that's 
unnecessary anyway since
+   all Relay expressions left over after lowering are interpreted by the 
runtime. However for AOT we
+   have to compile *all* Relay code for a particular target. Note the BOYC 
machinery does support this,
+   but for the purposes of redirecting the compilation flow entirely. We need 
a middle ground.
+5. The BYOC flow is not connected to the `Target` machinery in any way.
+6. The BYOC annotate/partition flow is very similar to the device 
annotate/rewrite flow. For comparison:
+
+   | Feature               | Device Planning            | BYOC                 
                           |
+   | --------------------- | -------------------------- | 
----------------------------------------------- |
+   | Source of annotations | `on_device`, `device_copy` | 
`FTVMAnnotateTarget`, `MergeComposite`+patterns |
+   | Target of planning    | DLDeviceType               | Toolchain name       
                           |
+   | Propagation           | Unification based          | Ad-hoc               
                           |
+   | Relay support         | Full                       | First-order, no ADTs 
                           |
+   | Delimiting            | insert `device_copy`       | insert 
`compiler_begin`, `compiler_end`         |
+   | Multiple per expr     | No                         | Yes (though always 
picks first)                 |
+   | Hoists into functions | No                         | Yes                  
                           |
+   | Customized heuristics | No                         | No                   
                           |
+
+   Taking the 'upper bound' of the two implementations seems ideal, especially 
to address issues 4 (limitation
+   of device planning) and 5 (limitation of BYOC) above.
+
+Our proposal is:
+1. We introduce a new FFI-friendly class to represent a *S*torage or 
*E*xecution *Scope*:
+
+   ```
+   class SEScope {
+     DLDeviceType device_type;
+     int virtual_device_id;
+     Target target;
+     String memory_scope;
+   }
+   ```
+
+   We allow each of these fields to be independently 'constrained' (ie have a 
specific value) or
+   'unconstrained' (no specific value for the field is known yet). In 
particular, it is valid for
+   an `SEScope` to contain only a `device_type`. However if the `target` field 
is defined then
+   `device_type` must equal `target->kind->device_type`.
+
+2. At this stage we leave the `memory_scope` field uninterpreted. For example, 
we don't attempt to
+   represent that, eg, `"global"` on a `kDLCPU` is the same memory area as 
`"host"` on a `kDLCUDA` and thus no
+   `device_copy` operation is required between those scopes. We'll pick this 
issue up again after
+   [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   has landed.
+
+3. The `on_device` and `device_copy` call attributes use `SEScope`s instead of 
integers. However the Python
+   bindings for these 'operators' continue to accept a `Device` for 
convenience. The machinery in `LowerTEPass`
+   which resolves `DLDeviceTypes` to `Targets` is moved up in the compilation 
flow and becomes part of
+   `PlanDevices`. In particular, any `SEScope` encountered during device 
planning is 'canonicalized' to fill
+   in a `Target` by the same lookup as we do today. This means we continue to 
support the easy shorthand of
+   referring to devices by the `DLDeviceType` alone. However, advanced users 
can supply a `SEScope` to these
+   operators which contains the exact `Target` to use.
+
+4. We rework device planning to be in terms of `SEScope`s instead of 
`DLDeviceTypes`. Two `SEScope`s
+   become special:
+    - We need a default scope for all primitive operators which are not 
otherwise
+      constrained to a particular scope.
+    - We need a scope for 'host-only' operations and data, such as for shapes 
and shape functions.
+      (Currently this is hardcoded to `kDLCPU`).
+
+5. We extend `PlanDevices` to be able to a) run *after* lowering and b) refine 
existing constraints.  It will
+   look inside calls to `PrimFunc`s and follow the chain:
+
+   ```
+   tir::PrimFunc.buffer_map -> tir::Buffer.data -> tir::Var.type_annotation -> 
PointerType.storage_scope -> String
+   ```
+
+   to discover the memory scope for each Relay argument. That scope will enter 
`SEScope`s and flow through the

Review comment:
       what do you mean by "enter `SEScope`s"?

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.
+4. Device planning currently has no machinery to hoist adjacent expressions 
which share the same device
+   into their own Relay `Function`. For all our executors except VM that's 
unnecessary anyway since
+   all Relay expressions left over after lowering are interpreted by the 
runtime. However for AOT we
+   have to compile *all* Relay code for a particular target. Note the BOYC 
machinery does support this,
+   but for the purposes of redirecting the compilation flow entirely. We need 
a middle ground.
+5. The BYOC flow is not connected to the `Target` machinery in any way.
+6. The BYOC annotate/partition flow is very similar to the device 
annotate/rewrite flow. For comparison:
+
+   | Feature               | Device Planning            | BYOC                 
                           |
+   | --------------------- | -------------------------- | 
----------------------------------------------- |
+   | Source of annotations | `on_device`, `device_copy` | 
`FTVMAnnotateTarget`, `MergeComposite`+patterns |
+   | Target of planning    | DLDeviceType               | Toolchain name       
                           |
+   | Propagation           | Unification based          | Ad-hoc               
                           |
+   | Relay support         | Full                       | First-order, no ADTs 
                           |
+   | Delimiting            | insert `device_copy`       | insert 
`compiler_begin`, `compiler_end`         |
+   | Multiple per expr     | No                         | Yes (though always 
picks first)                 |
+   | Hoists into functions | No                         | Yes                  
                           |
+   | Customized heuristics | No                         | No                   
                           |
+
+   Taking the 'upper bound' of the two implementations seems ideal, especially 
to address issues 4 (limitation
+   of device planning) and 5 (limitation of BYOC) above.
+
+Our proposal is:
+1. We introduce a new FFI-friendly class to represent a *S*torage or 
*E*xecution *Scope*:
+
+   ```
+   class SEScope {
+     DLDeviceType device_type;
+     int virtual_device_id;
+     Target target;
+     String memory_scope;
+   }
+   ```
+
+   We allow each of these fields to be independently 'constrained' (ie have a 
specific value) or
+   'unconstrained' (no specific value for the field is known yet). In 
particular, it is valid for
+   an `SEScope` to contain only a `device_type`. However if the `target` field 
is defined then
+   `device_type` must equal `target->kind->device_type`.
+
+2. At this stage we leave the `memory_scope` field uninterpreted. For example, 
we don't attempt to
+   represent that, eg, `"global"` on a `kDLCPU` is the same memory area as 
`"host"` on a `kDLCUDA` and thus no
+   `device_copy` operation is required between those scopes. We'll pick this 
issue up again after
+   [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   has landed.
+
+3. The `on_device` and `device_copy` call attributes use `SEScope`s instead of 
integers. However the Python
+   bindings for these 'operators' continue to accept a `Device` for 
convenience. The machinery in `LowerTEPass`
+   which resolves `DLDeviceTypes` to `Targets` is moved up in the compilation 
flow and becomes part of
+   `PlanDevices`. In particular, any `SEScope` encountered during device 
planning is 'canonicalized' to fill
+   in a `Target` by the same lookup as we do today. This means we continue to 
support the easy shorthand of
+   referring to devices by the `DLDeviceType` alone. However, advanced users 
can supply a `SEScope` to these
+   operators which contains the exact `Target` to use.
+
+4. We rework device planning to be in terms of `SEScope`s instead of 
`DLDeviceTypes`. Two `SEScope`s
+   become special:
+    - We need a default scope for all primitive operators which are not 
otherwise
+      constrained to a particular scope.
+    - We need a scope for 'host-only' operations and data, such as for shapes 
and shape functions.
+      (Currently this is hardcoded to `kDLCPU`).
+
+5. We extend `PlanDevices` to be able to a) run *after* lowering and b) refine 
existing constraints.  It will
+   look inside calls to `PrimFunc`s and follow the chain:
+
+   ```
+   tir::PrimFunc.buffer_map -> tir::Buffer.data -> tir::Var.type_annotation -> 
PointerType.storage_scope -> String
+   ```
+
+   to discover the memory scope for each Relay argument. That scope will enter 
`SEScope`s and flow through the
+   existing unification machinery. The existing sub-pass in `PlanDevices` will 
insert `device_copy` calls
+   wherever sub-expressions disagree on their memory scope.
+
+   (An additional pass is planned to heuristically move `device_copy`s around, 
and eliminate redundant
+    copies, however that's outside the scope of this RFC.)
+
+6. We rework `PartitionGraph` to `PartitionBySEScope` to work on `SEScope` 
annotations instead of
+   `compiler_begin` and `compiler_end` annotations. Algorithmically it's not a 
big change -- maximal
+   sub-expressions which share the same `SEScope` (or a projection thereof, eg 
just the `target`) are hoisted
+   into global `Function`s. The function's `"result_se_scope"` attribute 
describes both the scope holding the

Review comment:
       so then here, this sort of implements the "grouping adjacent expressions 
onto the same device" as a side-effect?

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`
+   annotations into new top-level `Function`s with a `"Compiler"` attribute 
bound to the toolchain
+   `<name>`.
+5. The rest of the compilation flow treats `"Compiler"` annotated functions 
specially.
+
+We have 6 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple
+   tensor-friendly devices (eg a GPU as well as an accelerator such as Arm 
'Ethos-U'). This means a
+   `DLDeviceType` no longer uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a
+   `DLDeviceType` and an arbitrary 'device id', TVM does not consistently 
plumb the device id
+   through annotations, passes and operators.  Thus currently we cannot use 
'device id' to
+   distinguish, eg, two CPUs in the same system.
+3. Upcoming work requires us to distinguish and propagate memory scopes for 
data at the Relay
+   level. (See also [RFC 
#9](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0009_Unified_Static_Memory_Planning.md)
+   which has a similar need for memory scope propagation at the TIR level). 
This is an identical
+   problem to propagating devices, and it seems most natural to simply combine 
targets, devices and
+   memory scopes into a single 'target of device planing' rather than 
implementing a whole new pass.
+4. Device planning currently has no machinery to hoist adjacent expressions 
which share the same device
+   into their own Relay `Function`. For all our executors except VM that's 
unnecessary anyway since
+   all Relay expressions left over after lowering are interpreted by the 
runtime. However for AOT we
+   have to compile *all* Relay code for a particular target. Note the BOYC 
machinery does support this,
+   but for the purposes of redirecting the compilation flow entirely. We need 
a middle ground.
+5. The BYOC flow is not connected to the `Target` machinery in any way.
+6. The BYOC annotate/partition flow is very similar to the device 
annotate/rewrite flow. For comparison:
+
+   | Feature               | Device Planning            | BYOC                 
                           |
+   | --------------------- | -------------------------- | 
----------------------------------------------- |
+   | Source of annotations | `on_device`, `device_copy` | 
`FTVMAnnotateTarget`, `MergeComposite`+patterns |
+   | Target of planning    | DLDeviceType               | Toolchain name       
                           |
+   | Propagation           | Unification based          | Ad-hoc               
                           |
+   | Relay support         | Full                       | First-order, no ADTs 
                           |
+   | Delimiting            | insert `device_copy`       | insert 
`compiler_begin`, `compiler_end`         |
+   | Multiple per expr     | No                         | Yes (though always 
picks first)                 |
+   | Hoists into functions | No                         | Yes                  
                           |
+   | Customized heuristics | No                         | No                   
                           |
+
+   Taking the 'upper bound' of the two implementations seems ideal, especially 
to address issues 4 (limitation
+   of device planning) and 5 (limitation of BYOC) above.
+
+Our proposal is:
+1. We introduce a new FFI-friendly class to represent a *S*torage or 
*E*xecution *Scope*:
+
+   ```
+   class SEScope {
+     DLDeviceType device_type;
+     int virtual_device_id;

Review comment:
       i think this should be a String name which makes sense to the user. 
Doing this is helpful for a couple other reasons besides the compilation UI:
   - In generated source code, it's possible to refer to the device by name. In 
particular, the embedded C API would like to have this for the conglomerate 
tvm_device_t struct.
   - In systems with multiple e.g. CPUs, using an index here then implies some 
ordering (e.g. littlest CPU to biggest). It's better to make the assignment of 
ID to CPU capability more explicit
   
   Finally, using a name would simplify the heterogeneous Target.
   
   However, this is a bit of a lift. I do feel strongly we should get to this 
world. If it's not something that makes sense to do now, we could also revisit 
after or concurrent with USMP.

##########
File path: rfcs/0038-unified-device-target-and-memory-scope-planning.md
##########
@@ -0,0 +1,244 @@
+- Feature Name: unified-target-device-and-memory-scope-planning
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0038](https://github.com/apache/tvm-rfcs/pull/0038)
+- GitHub Issue: [apache/tvm#9327](https://github.com/apache/tvm/issues/9327)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated
+on more than one device (GPU, CPU, accelerator, etc). For the non-BYOC flow 
this works as follows:
+1. Relay programs may contain `on_device` annotations which specify that a 
sub-expression's result
+   should reside on a device with a given `DLDeviceType` (`kDLCPU`, `kDLCUDA`, 
etc).
+2. The `PlanDevices` pass uses those annotations to decide the unique device 
for every Relay
+   sub-expression, including every primitive operator call. Sub-expressions 
which are unconstrained
+   are assigned to the 'default' device. The pass then inserts `device_copy` 
operators whenever data
+   needs to cross device boundaries.
+3. The user must also supply a list of `Target` objects. The compiler uses 
that list to build
+   a `TargetMap` from `DLDeviceType` to `Target`.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile
+   ('lower') that primitive for that device. The `Target` to use for that 
compilation is found from
+   the `TargetMap` by the `LowerTEPass`.
+
+For the BYOC flow things are quite different:
+1. Operators may be annotated with an `FTVMAnnotateTarget` function for a 
particular
+   `target.<name>`. Here `<name>` serves only to distinguish possible BYOC 
toolchain names and is
+   currently not connected to the `Target` machinery in any way. The function 
should return true if
+   the given expression could be compiled for toolchain `<name>`. (However 
there are currently no
+   examples of this annotation in-tree.)
+2. The `MergeComposite` pass can be used to assign a `"Composite"` attribute 
to Relay functions
+   which have been hoisted out of a larger expression based on a fusion 
pattern. The attribute can
+   have any value of the form `"some.arbitrary.prefix.<name>"`. Again, this 
indicates the function
+   could be compiled for toolchain `<name>`. (The EthosU compilation flow 
illustrates this approach
+   in-tree.)
+3. The `AnnotateTarget` pass looks for the annotations from (1) and (2) to 
decide the unique
+   toolchain name for every Relay sub-expression which should go via a BYOC 
path. The transitions in
+   to and out of those sub-expressions are marked with `compiler_begin` and 
`compiler_end`
+   annotations.
+4. The `PartitionGraph` pass hoists sub-expressions delimited by 
`compiler_begin` and `compiler_end`

Review comment:
       or is this the part where we translate the former thing to the 
hierarchical representation, and this is just how the implementation happens to 
be now? maybe @jroesch can comment here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] areusch commented on a change in pull request #38: [RFC] Unified device/target/memory scope planning

Reply via email to