trevor-m opened a new pull request #6395:
URL: https://github.com/apache/incubator-tvm/pull/6395
This PR adds support for partitioning, compiling, and running the TensorRT
BYOC target.
# Building
There are two new cmake flags:
* `USE_TENSORRT=ON/OFF`: enables TensorRT code generation - this does not
require TensorRT libraries
* `USE_TENSORRT_GRAPH_RUNTIME=ON/OFF/"path/to/TensorRT": enables TensorRT
runtime - this requires TensorRT libraries. A system wide install of TensorRT
from a deb package or JetPack can be detected by "ON", but a .tar.gz
installation requires you to provide path to the extracted TensorRT archive.
# Usage
The compilation target should be "cuda" to ensure that input and output args
to the TensorRT functions are placed on the GPU.
```
# Compilation
from tvm.relay.op.contrib import tensorrt
mod = tensorrt.partition_for_tensorrt(mod, params)
with relay.build_config(opt_level=3):
graph, lib, params = relay.build(mod, target="cuda", params=params)
# Running inference is unchanged
mod = graph_runtime.create(graph, lib, ctx=tvm.gpu(0))
mod.run(...)
```
# High level components
## Partitioning
The annotation rules for TensorRT change depending on the version of
TensorRT that is being targeted as well as the "batching mode". This can be
configured with the `trt_version` and `use_implicit_batch` args of
`partition_for_tensorrt`.
If TVM was built against the TensorRT library, the linked version is used
for partitioning instead.
## Codegen
This implementation using the JSONRuntime `JSONSerializer` base class for
codegen to serialize the relay expression to a json format.
## Runtime
During runtime, the `TensorRTBuilder` class (`tensorrt_builder.cc`) is used
to convert the json graph to a TensorRT `INetworkDefinition` using TensorRT
APIs. It uses the op converter classes in `tensorrt_ops.cc` to do this. Then,
the TensorRT engine is built, this process can take up to a few minutes because
TensorRT will perform its optimizations at this point. The engine is cached for
further inference calls.
The runtime can be compiled against many TensorRT versions thanks to if
guards I have added. It will work for TensorRT 5, 6, and 7.
# Areas I'm looking for feedback and ideas
1. TensorRT has parameters such as `max_workspace_size` and
`use_implicit_batch` which I want the user to be able to supply in
`partition_for_tensorrt`. These parameters need to be passed along to the
codegen and stored in the serialized graph until runtime. `use_implicit_batch`
also influences the partitioning rules. Currently, I'm using environment
variables to pass these from python to the codegen in C++. I wonder if there is
a better way to do this?
2. I've implemented a transformation called `prune_tensorrt_subgraphs()` in
`python/tvm/relay/op/contrib/tensorrt.py`. This is run after partitioning and
allows me to decide whether to keep a subgraph or return it back to the typical
TVM compilation path. This is needed because some subgraphs could be invalid -
such as when the inputs have different batch sizes or for optimization purposes
if the subgraph has no multiply-accumulates. I have also implemented a general
version of this in C++, but it uses the global registry to allow each codegen
target to define its own `is_invalid_subgraph` callback. In the future we can
switch to the generic version if we find a better way to register the callbacks.
3. The targeted tensorrt version needs to be accessed during annotation.
I've put it in a global variable for now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]