# Heterogeneous execution in Relay VM

## Goal

Relay graph runtime supports executing different parts of the graph in various 
devices, namely heterogeneous execution. We’d like to port the feature to Relay 
VM. 

## Non-goals

There is a limitation of device annotation pass that it assumes all the 
computation happens inside a single function, so it’s not able to compute the 
device assignment of multiple relay functions. It might be an issue that we 
allocate GPU tensor in the main function, but calls out to a tensor array 
concatenate operation which is another relay function, it might crash or copy 
to CPU memory(I haven’t experimented yet). A proper way to fix this is 
implement interprocedural analysis for the device annotation pass.

## Current Design in Relay Graph Runtime

### Compilation

Reference: https://github.com/dmlc/tvm/pull/2361

Summary: If users want to specify a device for an operator to run on, they can 
use an annotation operator named `on_device(expr, dev_id)`  to wrap an 
expression. At a step `RunDeviceAnnotationPass` during `relay.build`,  we will 
replace `on_device`  node with `device_copy` node. At the step of 
`PasGraphPlanMemory` , we compute the device assignment(`device_type` see next 
section) of each memory block. This is possible because graph runtime only 
support static graph, so we can capture all the information statically. Then 
during native code generation, `device_copy` node is mapped to special packed 
function named   `__copy`.

### Runtime

Reference: https://github.com/dmlc/tvm/pull/1695

Summary: In the graph json file, a new field named `device_type` specifies 
which device a static memory node should be scheduled to, the runtime allocates 
the memory in on the device accordingly. When graph runtime sees special 
operator named `__copy`, it calls `TVMArrayCopyFromTo`  to move memory across 
devices correctly. 

## Proposal for Relay VM

### Compilation

We should be able to reuse all the workflow up until `RunDeviceAnnotationPass`. 
VM compiler which translate relay expression into vm opcodes needs to map 
`device_copy`  node into an opcode named `DeviceCopy(src_register, 
dst_register)`. The tensor object in each register should have the device 
context so vm knows how to copy the data. This is a change to `AllocTensor` as 
well, we need to attach the device context to the instruction so we know where 
to allocate the memory, right now we just use the default context.

### VM Runtime

VM needs to implement the changes to `AllocTensor` and `DeviceCopy`.

## Tasks

- [ ] Add opcode DeviceCopy.
- [ ] Add device context to AllocTensor.
- [ ] Change VMCompiler to attach device context to AllocTensor.
- [ ] Change VMCompiler to emit DeviceCopy opcode.

cc @icemelon9 @zhiics @zxy844288792 @jroesch @tqchen 


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4178

Reply via email to