mbs-octoml commented on a change in pull request #38:
URL: https://github.com/apache/tvm-rfcs/pull/38#discussion_r734004042



##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be 
(sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as 
follows:
+1. Relay programs may contain "on_device" annotations which specify that a 
sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique 
device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are 
unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need 
to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler 
uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals 
we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is 
found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 
'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This 
means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a 
pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id 
through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in 
the same system.
+3. The codebase still uses an older `target` and `target_host` convention for 
distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, 
shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' 
scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 
'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s 
will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` 
`Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which 
`PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for 
available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows 
`"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the 
matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every 
Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to 
cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, 
and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it 
involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.

Review comment:
       So I've rejigged to take on 5 and put less emphasis on the target 
wrangling which I think will work its self out by a combination of @Mousius 
work and incremental cleanup. Deep breath.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to