Re: Details regarding upcoming PR for runtime TensorRT integration

Marek Kolodziej Mon, 11 Jun 2018 10:56:26 -0700

Hi Marco,

Sorry for the formatting being lost.


Here's the original Google doc. I actually wanted to originally use
Confluence, but I didn't have permissions to edit, so here goes.

https://docs.google.com/document/d/1UbsUacxWRKXCEE6v0r4VmKL76QLmFQYgMyAcQP0I8U0/edit?usp=sharing

Best,

Marek



On Mon, Jun 11, 2018 at 10:54 AM Marco de Abreu <
[email protected]> wrote:

> Hello Marek,
>
> this sounds great! Definitely looking forward to it.
>
> It seems like our mailing list destroyed your formatting. You might want to
> consider putting it into a Google Docs document or uploading it to
> confluence.
>
> Best regards,
> Marco
>
> On Mon, Jun 11, 2018 at 10:50 AM Marek Kolodziej <[email protected]> wrote:
>
> > *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing
> an
> > initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> > discussion coming in the next few days, ETA of the first draft of the PR
> is
> > this Friday or even earlier). Feedback is appreciated.Best,Marek
> > KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> > significant acceleration of model inference on NVIDIA GPUs compared to
> > running the full graph in MxNet using unfused GPU operators. In addition
> to
> > faster fp32 inference, TensorRT optimizes fp16 inference, and is capable
> of
> > int8 inference (provided the quantization steps are performed). Besides
> > increasing throughput, TensorRT significantly reduces inference latency,
> > especially for small batches. See more here
> > <https://developer.nvidia.com/tensorrt>.2. Despite its benefits, using
> > pre-trained models with TensorRT typically requires some effort - either
> > re-writing the model using TensorRT’s graph building APIs, or exporting a
> > model to ONNX, followed by an import step. Even if the import is
> simplified
> > using ONNX, the TensorRT user still needs to provide their own data
> > pipeline, which used to exist in the framework, but no longer does in a
> > stand-alone TensorRT deployment with a client application.3. TensorRT is
> > very performant, but does not have the full set of MxNet’s operators.
> While
> > that could be addressed with TensorRT plugins, it’s much simpler to reuse
> > already-exisitng MxNet operators. Also, the user shouldn’t care about
> > knowing which operators are supported by TensorRT and which ones aren’t -
> > runtime integration allows the graph partitioner to extract subgraphs
> > capable of running inside of TensorRT, place the subgraph in a TensorRT
> > operator in MxNet, execute that operator as part of MxNet’s graph
> > execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> > operators remaining after the TensorRT subgraph extraction and node
> > substitution. The goal is to accelerate inference without changing user
> > experience.Design considerations 1. Since TensorRT can only determine all
> > possible optimizations once the tensor shapes are known, it is imperative
> > that all the shape information be provided. This means that the best time
> > to construct the TensorRT graph is bind time. The coming PR can
> selectively
> > apply the TensorRT optimization for inference-only graphs at symbol bind
> > time. This is in fact consistent with the assumptions about TensorRT made
> > on the MxNet Wiki here
> > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> > >.
> > 2. Since as mentioned in #1, TensorRT graph building needs shape
> > information only available at bind time, an important goal was not to
> > disrupt any existing APIs. Even though C++ permits default function
> > arguments, the Python bindings for symbol-related methods (e.g. simple
> > bind) are exposed via a C, not C++, API, wired on the Python side using
> > Ctypes (e.g. see here
> > <
> >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> > >
> > for the simple bind integration). This precludes the addition of extra
> > arguments without causing breaking changes in the C API. Also, adapting
> the
> > Python code to such changes wouldn’t be enough, since all frontend
> > languages use the C (not C++) API for the FFI. Fortunately, C API changes
> > could be avoided, by simply letting the user enable or disable the
> TensorRT
> > pass using an environment variable (USE_TENSORRT=1 to enable). This also
> > does not diminish the flexibility of the integration, since the graph
> pass
> > can read the environment variable each time symbol binding is done, and
> > hence permits turning the graph passes on and off, depending on need. The
> > ability to enable and disable the TensorRT pass at runtime also makes
> unit
> > testing easier.3. TensorRT requires that the workspace size is provided
> at
> > graph construction time. This value constitutes the upper limit on the
> > amount of memory that TensorRT can use, and does not determine immediate
> > use. Since this amount can be hard for the user to know, its limit should
> > be set to a reasonable value that the user need not concern themselves
> > with. Given that TensorRT integration is applied at bind time and that
> > TensorRT engines wrapped in TensorRT nodes are constructed during the
> graph
> > pass rather than the memory allocation pass,  MxNet will only allocate
> the
> > amount needed for the nodes remaining after the TensorRT subgraphs have
> > been extracted. This means that no memory will be doubly allocated -
> first
> > for the complete MxNet subgraph and then for TensorRT. However, the
> > question remains whether the memory used per TensorRT engine should be a
> > configurable parameter, either as a method argument or an environment
> > variable, or whether TensorRT should be able to use the maximum available
> > GPU memory and then reserve only what it needs. I would like to suggest
> the
> > latter. Since the TensorRT subgraph will typically use less memory than
> the
> > same subgraph in MxNet (due to more layer fusion), it’s extremely
> unlikely
> > that a model which runs purely as an MxNet graph would fail with an ouf
> of
> > memory error when parts or most of the graph run inside TensorRT. Fewer
> > knobs (in this case, not giving the user the ability to tweak the maximum
> > amount of memory availble to TensorRT would simplify use.4. TensorRT can
> > accept graphs constructed using two main approaches: (a) via the TensorRT
> > graph API, (b) using ONNX. Approach (a) seems simple on the surface - one
> > traverses the NNVM graph, finds subgraphs that TensorRT can execute,
> > converts the subgraphs to TensorRT graphs, and substitutes the subgraphs
> > with TensorRT nodes, each of which contain the TensorRT engine
> > corresponding to the subgraph. However, the approach taken by NVIDA was
> to
> > use ONNX as tha IR. The reason for this is twofold. First, ONNX is a very
> > well-known IR, which is supported by the entire deep learning software
> > community. This ensures that the design of the IR gets as much feedback
> as
> > possible as to whether the IR is feature complete, and what the semantics
> > are. NVIDIA already maintains an ONNX-to-TensorRT converter (link
> > <https://github.com/onnx/onnx-tensorrt>), and will continue to do so.
> > Whatever changes that may apply to the TensorRT APIs or the internal
> > features may be nicely hidden behind the well-established ONNX IR.
> Second,
> > ONNX is growing beyond being merely an IR. As it becomes more of a
> > standard, its adoption will be associated with other benefits, such as
> the
> > ability to verify standard compliance.5. Despite the advantages of using
> >  the ONNX route described in #4, there are some costs. The main one is
> the
> > dependency on Protobuf. This is a valid criticism on the surface,
> however,
> > since the TensorRT integration requires an opt-in during build time,
> adding
> > one more dependency is not a problem if it is not a mandatory dependency.
> > Moreover, the same Protobuf dependency already exists for the MxNet ONNX
> > importer, which is now part of the MxNet source tree (link
> > <
> >
> https://github.com/apache/incubator-mxnet/blob/76417594e56a85ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
> > >),
> > rather than being located in a separate repository. Just like the use of
> > the ONNX importer is optional and requires ONNX (and hence also
> Protobuf),
> > the TensorRT build is optional. 6. The optional integration of TensorRT
> > will be guarded using a config.mk <http://config.mk> flag
> (USE_TENSORRT),
> > which will function similarly to other flags, such as USE_CUDA,
> USE_CUDNN,
> > etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7. In
> > order to simplify evaluation of the TensorRT build, usability and to run
> > unit tests, the PR will come with a Dockerfile, which will allow anyone
> to
> > build MxNet with TensorRT, along with its dependencies, i.e. Protobuf and
> > ONNX. APIs / user experienceThere is no change in the inference APIs,
> > except for the need to set the MXNET_USE_TENSORRT environment variable to
> > 1. For example, in Python, we can simply
> > do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
> > compatibility, if the environment variable is not set, it will default to
> > 0. Also, unlike some other environment variables that are only checked
> > during MxNet initialization, this one gets checked every time graph
> binding
> > happens. This typically happens only once during the inference
> > application’s life cycle, but since one can re-bind a symbol to say
> compare
> > a TensorRT and a non-TensorRT run, the check will happen during each
> > bind/re-bind to enable that. Since the TensorRT graph pass is enabled
> using
> > an environment variable, no break in the C++, C or any frontend language
> > API is needed. Note that there is one more change required - in calling
> > simple bind. This doesn’t change the simple bind API, but how it’s called
> > relative to the “usual” case, by using some of the arguments which are
> > optional. This has to do with the shared_buffer parameter. Before
> > explaining how the call changes, let’s consider why it’s necessary: 1.
> The
> > TensorRT graph needs to be constructed during the simple bind call, but
> > before memory gets allocated for the non-TensorRT part of the graph. 2.
> > TensorRT needs the weights, not just the shapes, to be provided before
> the
> > engine is constructed - it will store them inside the ICudaEngine object.
> > The engine will then be serialized inside the NNVM TensorRT op, and
> > deserialized when the graph executor takes over. This means that the
> > weights need to be provided to the simple bind call to construct the
> > TensorRT engine.3. The way to provide the weights is to hand them over to
> > the simple bind call via the “shared buffer” argument. The shared buffer
> > weights can be provided during the bind call and can be freed by the
> > frontend language once binding is complete (e.g. by exiting the relevant
> > scope in Python, or calling del).Since we need both arg_params (weights)
> > and aux_params (e.g. BatchNorm moments), we need to merge arg_params and
> > aux_params into one dictionary. Here’s a Python example:def
> > merge_dicts(*dict_args):    """Merge arg_params and aux_params to
> populate
> > shared_buffer"""    result = {}    for dictionary in dict_args:
> >        result.update(dictionary)    return resultNow let’s see a use
> > example:device = mx.gpu(0)sym, arg_params, aux_params =
> >    mx.model.load_checkpoint(model_name, num_epochs)executor =
> > sym.simple_bind(ctx=device,    data=data_shape,
> >    softmax_label=(batch_size,),    shared_buffer=merge_dicts(arg_params,
> > aux_params),,    grad_req='null',    force_rebind=True)Now we can simply
> > update data in the executor’s arg dict and run the forward
> > pass:executor.arg_dict["data"][:] =
> > my_data_batchexecutor.forward(is_train=False)predictions =
> > executor.outputs[0].asnumpy()Limitations of initial integration and
> > suggested future work 1. Since the new accelerator API proposal (link
> > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> > >)
> > was only published a few days ago and the implementation is still on an
> > MxNet fork, the current TensorRT integration doesn’t use that API yet,
> but
> > could be refactored in a future commit to use it. There is nothing in the
> > current design that would prevent making use of that API in the near
> > future.2. Building the TensorRT engine takes a non-trivial amount of
> time,
> > because the compiler evaluates performance and the hardware on the system
> > before creating the fused layers on demand, and then needs to actually
> > compile them. For ResNet-50 this may be a few seconds, but larger models
> > also exist which may take longer. TensorRT comes with the ability to
> > serialize the TensorRT engine for a particular hardware platform. This is
> > called the serialization of a TensorRT plan, which is the engine along
> with
> > the ahead-of-time-compiled fused kernels for a given GPU. The first PR of
> > the TensorRT integration will not provide for TensorRT plan caching, so
> > using TensorRT might have a small start-up cost, but for long-running
> > inference processes, this shouldn’t be a problem. Caching the TensorRT
> plan
> > will be addressed in a future commit.3. As mentioned before, the
> > reproducibility of the build will be demonstrated using a Docker file
> that
> > will provide an easy way to evaluate the build. The Docker recipe was
> > tested on Linux on x86_64, but not other platforms supported by TensorRT
> > (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on aarch64).
> > Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e. Linux
> for
> > Tegra, on the NVIDIA Jetson platform) is left for subsequent commits. 4.
> > The current commit supports many, but not all, of TensorRT operators. For
> > example, this integration can run CNNs such as VGG, or ResNet, but not
> > necessarily everything that TensorRT can support. More operators will be
> > covered in future commits.5. TensorRT supports plugins, which can be
> > integrated into the graph pass. However, this was not a priority since
> the
> > runtime TensorRT integration can always fall back to existing MxNet
> > operators. Supporting plugins is possible, but will be added in future
> > commits.6. The upcoming PR will support fp16 and fp32, but not int8.
> Since
> > int8 support in MxNet is itself very new, figuring out calibration and
> > other details is left for a future commit.7. TensorRT 4 is going to have
> a
> > new feature called BYOM (bring your own memory). This means that instead
> of
> > telling TensorRT how much memory it can use, the data/scratch space
> tensors
> > can be provided by MxNet, and can be re-used by MxNet when not running
> the
> > forward pass. The memory in permanent use will then be limited to
> TensorRT
> > storing weights. Support for this feature will be added in a future
> > commit.*
> >
>

Re: Details regarding upcoming PR for runtime TensorRT integration

Reply via email to