mbs-octoml commented on a change in pull request #38:
URL: https://github.com/apache/tvm-rfcs/pull/38#discussion_r725285510



##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as 
follows:
+1. Relay programs may contain "on_device" annotations which specify that a 
sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique 
device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are 
unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need 
to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler 
uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is 
found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This 
means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id 
through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in 
the same system.
+3. The codebase still uses an older `target` and `target_host` convention for 
distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, 
shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' 
scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 
'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s 
will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` 
`Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which 
`PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for 
available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows 
`"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the 
matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every 
Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to 
cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, 
and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it 
involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.
+2. Allow `Target` objects to be registered under a globally unique target 
label. Registration may be 'static' (ie
+   built into the TVM compiler via another REGISTER macro) and 'dynamic' (ie 
injected for a particular run of the
+   compiler, eg as part of `tvmc` command line processing). (This machinery 
should be reconciled with the existing
+   CUDA-specific target registration map.)
+3. Change the "on_device" call attributes to use a string instead of an 
integers (ie `DLDeviceType`). The string
+   can be of the form `<target label>` or `<target label>:<device id>`. The 
former simply implies a device id of 0.
+4. Rework device planning to use a pair of `Target` and 'device id' instead of 
`DLDeviceType`:
+   ```
+   class TargetDevice {
+    public:
+     Target target;
+     int device_id;
+   }
+   ```
+   (We could also use a `Device` and accept the redundant `DLDeviceType` 
specification.) It is trivial
+   to go from an "on_device" label to a `TargetDevice` and back using the 
global `Target` registry.
+5. Remove all uses of `TargetMap`. For example, in `LowerTEPass` we simply use 
the `TargetDevice` associated with
+   every primitive operator call already found by device planning.
+6. Bind two `TargetDevice`s as attributes on every `IRModule`:
+    - The default for primitive operators not otherwise constrained by 
"on_device" annotations.
+    - The default for non primitive operators, such as Relay control flow and 
shape computation.
+7. We remove the various copies of target/target_host reconciliation, 
`TargetMap`
+   construction and 'default/fallback' device calculation from the codebase.
+
+This proposal tackles the original problems:
+1. There's now no ambiguity about `Targets` since we propagate them from the 
global registry directly.
+2. We support device ids.

Review comment:
       I'm going to call these `virtual_device_id`s, and leave how they are 
mapped to physical device ids either later in the compilation or at runtime 
unspecified. For example, the `virtual_device_id` could be nothing other than 
an index into an array of actual device structures at runtime.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to