Re: Details regarding upcoming PR for runtime TensorRT integration

Naveen Swamy Mon, 11 Jun 2018 11:35:25 -0700

I'll add in about an hour


> On Jun 11, 2018, at 8:12 PM, Marco de Abreu <[email protected]> 
> wrote:
> 
> I don't know how to grant permission on Confluence. If somebody else knows
> how to do so, please grant Marek the edit permissions.
> 
> -Marco
> 
>> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <[email protected]> wrote:
>> 
>> Hi Rajan,
>> 
>> I wanted to share on Confluence, but it didn't allow me to create a new
>> document. If my e-mail address gets permissions to add new Confluence
>> pages, I'll transfer the contents to Confluence. Please keep me posted when
>> I get edit permissions.
>> 
>> Thanks!
>> 
>> Marek
>> 
>> 
>> 
>> On Mon, Jun 11, 2018 at 11:02 AM [email protected] <
>> [email protected]> wrote:
>> 
>>> HI Marek,
>>> 
>>> Thanks for sharing the  document. It would be great if you could share it
>>> on confluence wiki or a quip document. The formatting here makes it very
>>> difficult to read a long document.
>>> 
>>> Appreciate the help.
>>> 
>>> Thanks
>>> Rajan
>>> 
>>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]> wrote:
>>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for
>> open-sourcing
>>> an
>>>> initial integration of TensorRT as a runtime accelerator of MxNet (PR
>> for
>>>> discussion coming in the next few days, ETA of the first draft of the
>> PR
>>> is
>>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
>>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
>> provides
>>>> significant acceleration of model inference on NVIDIA GPUs compared to
>>>> running the full graph in MxNet using unfused GPU operators. In
>> addition
>>> to
>>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is
>> capable
>>> of
>>>> int8 inference (provided the quantization steps are performed). Besides
>>>> increasing throughput, TensorRT significantly reduces inference
>> latency,
>>>> especially for small batches. See more here
>>>> <https://developer.nvidia.com/tensorrt>.2. Despite its benefits, using
>>>> pre-trained models with TensorRT typically requires some effort -
>> either
>>>> re-writing the model using TensorRT’s graph building APIs, or
>> exporting a
>>>> model to ONNX, followed by an import step. Even if the import is
>>> simplified
>>>> using ONNX, the TensorRT user still needs to provide their own data
>>>> pipeline, which used to exist in the framework, but no longer does in a
>>>> stand-alone TensorRT deployment with a client application.3. TensorRT
>> is
>>>> very performant, but does not have the full set of MxNet’s operators.
>>> While
>>>> that could be addressed with TensorRT plugins, it’s much simpler to
>> reuse
>>>> already-exisitng MxNet operators. Also, the user shouldn’t care about
>>>> knowing which operators are supported by TensorRT and which ones
>> aren’t -
>>>> runtime integration allows the graph partitioner to extract subgraphs
>>>> capable of running inside of TensorRT, place the subgraph in a TensorRT
>>>> operator in MxNet, execute that operator as part of MxNet’s graph
>>>> execusion, and handle non-TensorRT-compatible nodes as regular MxNet
>>>> operators remaining after the TensorRT subgraph extraction and node
>>>> substitution. The goal is to accelerate inference without changing user
>>>> experience.Design considerations 1. Since TensorRT can only determine
>> all
>>>> possible optimizations once the tensor shapes are known, it is
>> imperative
>>>> that all the shape information be provided. This means that the best
>> time
>>>> to construct the TensorRT graph is bind time. The coming PR can
>>> selectively
>>>> apply the TensorRT optimization for inference-only graphs at symbol
>> bind
>>>> time. This is in fact consistent with the assumptions about TensorRT
>> made
>>>> on the MxNet Wiki here
>>>> <
>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
>>>> .
>>>> 2. Since as mentioned in #1, TensorRT graph building needs shape
>>>> information only available at bind time, an important goal was not to
>>>> disrupt any existing APIs. Even though C++ permits default function
>>>> arguments, the Python bindings for symbol-related methods (e.g. simple
>>>> bind) are exposed via a C, not C++, API, wired on the Python side using
>>>> Ctypes (e.g. see here
>>>> <
>>> 
>> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
>>>> 
>>>> for the simple bind integration). This precludes the addition of extra
>>>> arguments without causing breaking changes in the C API. Also, adapting
>>> the
>>>> Python code to such changes wouldn’t be enough, since all frontend
>>>> languages use the C (not C++) API for the FFI. Fortunately, C API
>> changes
>>>> could be avoided, by simply letting the user enable or disable the
>>> TensorRT
>>>> pass using an environment variable (USE_TENSORRT=1 to enable). This
>> also
>>>> does not diminish the flexibility of the integration, since the graph
>>> pass
>>>> can read the environment variable each time symbol binding is done, and
>>>> hence permits turning the graph passes on and off, depending on need.
>> The
>>>> ability to enable and disable the TensorRT pass at runtime also makes
>>> unit
>>>> testing easier.3. TensorRT requires that the workspace size is provided
>>> at
>>>> graph construction time. This value constitutes the upper limit on the
>>>> amount of memory that TensorRT can use, and does not determine
>> immediate
>>>> use. Since this amount can be hard for the user to know, its limit
>> should
>>>> be set to a reasonable value that the user need not concern themselves
>>>> with. Given that TensorRT integration is applied at bind time and that
>>>> TensorRT engines wrapped in TensorRT nodes are constructed during the
>>> graph
>>>> pass rather than the memory allocation pass,  MxNet will only allocate
>>> the
>>>> amount needed for the nodes remaining after the TensorRT subgraphs have
>>>> been extracted. This means that no memory will be doubly allocated -
>>> first
>>>> for the complete MxNet subgraph and then for TensorRT. However, the
>>>> question remains whether the memory used per TensorRT engine should be
>> a
>>>> configurable parameter, either as a method argument or an environment
>>>> variable, or whether TensorRT should be able to use the maximum
>> available
>>>> GPU memory and then reserve only what it needs. I would like to suggest
>>> the
>>>> latter. Since the TensorRT subgraph will typically use less memory than
>>> the
>>>> same subgraph in MxNet (due to more layer fusion), it’s extremely
>>> unlikely
>>>> that a model which runs purely as an MxNet graph would fail with an ouf
>>> of
>>>> memory error when parts or most of the graph run inside TensorRT. Fewer
>>>> knobs (in this case, not giving the user the ability to tweak the
>> maximum
>>>> amount of memory availble to TensorRT would simplify use.4. TensorRT
>> can
>>>> accept graphs constructed using two main approaches: (a) via the
>> TensorRT
>>>> graph API, (b) using ONNX. Approach (a) seems simple on the surface -
>> one
>>>> traverses the NNVM graph, finds subgraphs that TensorRT can execute,
>>>> converts the subgraphs to TensorRT graphs, and substitutes the
>> subgraphs
>>>> with TensorRT nodes, each of which contain the TensorRT engine
>>>> corresponding to the subgraph. However, the approach taken by NVIDA was
>>> to
>>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX is a
>> very
>>>> well-known IR, which is supported by the entire deep learning software
>>>> community. This ensures that the design of the IR gets as much feedback
>>> as
>>>> possible as to whether the IR is feature complete, and what the
>> semantics
>>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter (link
>>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to do so.
>>>> Whatever changes that may apply to the TensorRT APIs or the internal
>>>> features may be nicely hidden behind the well-established ONNX IR.
>>> Second,
>>>> ONNX is growing beyond being merely an IR. As it becomes more of a
>>>> standard, its adoption will be associated with other benefits, such as
>>> the
>>>> ability to verify standard compliance.5. Despite the advantages of
>> using
>>>> the ONNX route described in #4, there are some costs. The main one is
>>> the
>>>> dependency on Protobuf. This is a valid criticism on the surface,
>>> however,
>>>> since the TensorRT integration requires an opt-in during build time,
>>> adding
>>>> one more dependency is not a problem if it is not a mandatory
>> dependency.
>>>> Moreover, the same Protobuf dependency already exists for the MxNet
>> ONNX
>>>> importer, which is now part of the MxNet source tree (link
>>>> <
>>> 
>> https://github.com/apache/incubator-mxnet/blob/76417594e56a85ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
>>>> ),
>>>> rather than being located in a separate repository. Just like the use
>> of
>>>> the ONNX importer is optional and requires ONNX (and hence also
>>> Protobuf),
>>>> the TensorRT build is optional. 6. The optional integration of TensorRT
>>>> will be guarded using a config.mk <http://config.mk> flag
>>> (USE_TENSORRT),
>>>> which will function similarly to other flags, such as USE_CUDA,
>>> USE_CUDNN,
>>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7. In
>>>> order to simplify evaluation of the TensorRT build, usability and to
>> run
>>>> unit tests, the PR will come with a Dockerfile, which will allow anyone
>>> to
>>>> build MxNet with TensorRT, along with its dependencies, i.e. Protobuf
>> and
>>>> ONNX. APIs / user experienceThere is no change in the inference APIs,
>>>> except for the need to set the MXNET_USE_TENSORRT environment variable
>> to
>>>> 1. For example, in Python, we can simply
>>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
>>>> compatibility, if the environment variable is not set, it will default
>> to
>>>> 0. Also, unlike some other environment variables that are only checked
>>>> during MxNet initialization, this one gets checked every time graph
>>> binding
>>>> happens. This typically happens only once during the inference
>>>> application’s life cycle, but since one can re-bind a symbol to say
>>> compare
>>>> a TensorRT and a non-TensorRT run, the check will happen during each
>>>> bind/re-bind to enable that. Since the TensorRT graph pass is enabled
>>> using
>>>> an environment variable, no break in the C++, C or any frontend
>> language
>>>> API is needed. Note that there is one more change required - in calling
>>>> simple bind. This doesn’t change the simple bind API, but how it’s
>> called
>>>> relative to the “usual” case, by using some of the arguments which are
>>>> optional. This has to do with the shared_buffer parameter. Before
>>>> explaining how the call changes, let’s consider why it’s necessary: 1.
>>> The
>>>> TensorRT graph needs to be constructed during the simple bind call, but
>>>> before memory gets allocated for the non-TensorRT part of the graph. 2.
>>>> TensorRT needs the weights, not just the shapes, to be provided before
>>> the
>>>> engine is constructed - it will store them inside the ICudaEngine
>> object.
>>>> The engine will then be serialized inside the NNVM TensorRT op, and
>>>> deserialized when the graph executor takes over. This means that the
>>>> weights need to be provided to the simple bind call to construct the
>>>> TensorRT engine.3. The way to provide the weights is to hand them over
>> to
>>>> the simple bind call via the “shared buffer” argument. The shared
>> buffer
>>>> weights can be provided during the bind call and can be freed by the
>>>> frontend language once binding is complete (e.g. by exiting the
>> relevant
>>>> scope in Python, or calling del).Since we need both arg_params
>> (weights)
>>>> and aux_params (e.g. BatchNorm moments), we need to merge arg_params
>> and
>>>> aux_params into one dictionary. Here’s a Python example:def
>>>> merge_dicts(*dict_args):    """Merge arg_params and aux_params to
>>> populate
>>>> shared_buffer"""    result = {}    for dictionary in dict_args:
>>>>       result.update(dictionary)    return resultNow let’s see a use
>>>> example:device = mx.gpu(0)sym, arg_params, aux_params =
>>>>   mx.model.load_checkpoint(model_name, num_epochs)executor =
>>>> sym.simple_bind(ctx=device,    data=data_shape,
>>>>   softmax_label=(batch_size,),
>> shared_buffer=merge_dicts(arg_params,
>>>> aux_params),,    grad_req='null',    force_rebind=True)Now we can
>> simply
>>>> update data in the executor’s arg dict and run the forward
>>>> pass:executor.arg_dict["data"][:] =
>>>> my_data_batchexecutor.forward(is_train=False)predictions =
>>>> executor.outputs[0].asnumpy()Limitations of initial integration and
>>>> suggested future work 1. Since the new accelerator API proposal (link
>>>> <
>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
>>>> )
>>>> was only published a few days ago and the implementation is still on an
>>>> MxNet fork, the current TensorRT integration doesn’t use that API yet,
>>> but
>>>> could be refactored in a future commit to use it. There is nothing in
>> the
>>>> current design that would prevent making use of that API in the near
>>>> future.2. Building the TensorRT engine takes a non-trivial amount of
>>> time,
>>>> because the compiler evaluates performance and the hardware on the
>> system
>>>> before creating the fused layers on demand, and then needs to actually
>>>> compile them. For ResNet-50 this may be a few seconds, but larger
>> models
>>>> also exist which may take longer. TensorRT comes with the ability to
>>>> serialize the TensorRT engine for a particular hardware platform. This
>> is
>>>> called the serialization of a TensorRT plan, which is the engine along
>>> with
>>>> the ahead-of-time-compiled fused kernels for a given GPU. The first PR
>> of
>>>> the TensorRT integration will not provide for TensorRT plan caching, so
>>>> using TensorRT might have a small start-up cost, but for long-running
>>>> inference processes, this shouldn’t be a problem. Caching the TensorRT
>>> plan
>>>> will be addressed in a future commit.3. As mentioned before, the
>>>> reproducibility of the build will be demonstrated using a Docker file
>>> that
>>>> will provide an easy way to evaluate the build. The Docker recipe was
>>>> tested on Linux on x86_64, but not other platforms supported by
>> TensorRT
>>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on aarch64).
>>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e. Linux
>>> for
>>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent commits.
>> 4.
>>>> The current commit supports many, but not all, of TensorRT operators.
>> For
>>>> example, this integration can run CNNs such as VGG, or ResNet, but not
>>>> necessarily everything that TensorRT can support. More operators will
>> be
>>>> covered in future commits.5. TensorRT supports plugins, which can be
>>>> integrated into the graph pass. However, this was not a priority since
>>> the
>>>> runtime TensorRT integration can always fall back to existing MxNet
>>>> operators. Supporting plugins is possible, but will be added in future
>>>> commits.6. The upcoming PR will support fp16 and fp32, but not int8.
>>> Since
>>>> int8 support in MxNet is itself very new, figuring out calibration and
>>>> other details is left for a future commit.7. TensorRT 4 is going to
>> have
>>> a
>>>> new feature called BYOM (bring your own memory). This means that
>> instead
>>> of
>>>> telling TensorRT how much memory it can use, the data/scratch space
>>> tensors
>>>> can be provided by MxNet, and can be re-used by MxNet when not running
>>> the
>>>> forward pass. The memory in permanent use will then be limited to
>>> TensorRT
>>>> storing weights. Support for this feature will be added in a future
>>> commit.*
>>>> 
>>> 
>>

Re: Details regarding upcoming PR for runtime TensorRT integration

Reply via email to