trevor-m opened a new pull request #6395:
URL: https://github.com/apache/incubator-tvm/pull/6395


   This PR adds support for partitioning, compiling, and running the TensorRT 
BYOC target.
   
   # Building
   
   There are two new cmake flags:
   * `USE_TENSORRT=ON/OFF`: enables TensorRT code generation - this does not 
require TensorRT libraries
   * `USE_TENSORRT_GRAPH_RUNTIME=ON/OFF/"path/to/TensorRT": enables TensorRT 
runtime - this requires TensorRT libraries. A system wide install of TensorRT 
from a deb package or JetPack can be detected by "ON", but a .tar.gz 
installation requires you to provide path to the extracted TensorRT archive.
   
   # Usage
   
   The compilation target should be "cuda" to ensure that input and output args 
to the TensorRT functions are placed on the GPU.
   
   ```
   # Compilation
   from tvm.relay.op.contrib import tensorrt
   mod = tensorrt.partition_for_tensorrt(mod, params)
   with relay.build_config(opt_level=3):
     graph, lib, params = relay.build(mod, target="cuda", params=params)
   # Running inference is unchanged
   mod = graph_runtime.create(graph, lib, ctx=tvm.gpu(0))
   mod.run(...)
   ```
   
   # High level components
   
   ## Partitioning
   
   The annotation rules for TensorRT change depending on the version of 
TensorRT that is being targeted as well as the "batching mode". This can be 
configured with the `trt_version` and `use_implicit_batch` args of 
`partition_for_tensorrt`.
   
   If TVM was built against the TensorRT library, the linked version is used 
for partitioning instead.
   
   ## Codegen
   
   This implementation using the JSONRuntime `JSONSerializer` base class for 
codegen to serialize the relay expression to a json format.
   
   ## Runtime
   
   During runtime, the `TensorRTBuilder` class (`tensorrt_builder.cc`) is used 
to convert the json graph to a TensorRT `INetworkDefinition` using TensorRT 
APIs. It uses the op converter classes in `tensorrt_ops.cc` to do this. Then, 
the TensorRT engine is built, this process can take up to a few minutes because 
TensorRT will perform its optimizations at this point. The engine is cached for 
further inference calls.
   
   The runtime can be compiled against many TensorRT versions thanks to if 
guards I have added. It will work for TensorRT 5, 6, and 7.
   
   # Areas I'm looking for feedback and ideas
   
   1. TensorRT has parameters such as `max_workspace_size` and 
`use_implicit_batch` which I want the user to be able to supply in 
`partition_for_tensorrt`. These parameters need to be passed along to the 
codegen and stored in the serialized graph until runtime. `use_implicit_batch` 
also influences the partitioning rules. Currently, I'm using environment 
variables to pass these from python to the codegen in C++. I wonder if there is 
a better way to do this?
   
   2. I've implemented a transformation called `prune_tensorrt_subgraphs()` in 
`python/tvm/relay/op/contrib/tensorrt.py`. This is run after partitioning and 
allows me to decide whether to keep a subgraph or return it back to the typical 
TVM compilation path. This is needed because some subgraphs could be invalid - 
such as when the inputs have different batch sizes or for optimization purposes 
if the subgraph has no multiply-accumulates. I have also implemented a general 
version of this in C++, but it uses the global registry to allow each codegen 
target to define its own `is_invalid_subgraph` callback. In the future we can 
switch to the generic version if we find a better way to register the callbacks.
   
   3. The targeted tensorrt version needs to be accessed during annotation. 
I've put it in a global variable for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to