Re: Details regarding upcoming PR for runtime TensorRT integration

Naveen Swamy Mon, 11 Jun 2018 12:55:19 -0700

you have access now.

On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <[email protected]> wrote:


> I'll add in about an hour
>
> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
> [email protected]> wrote:
> >
> > I don't know how to grant permission on Confluence. If somebody else
> knows
> > how to do so, please grant Marek the edit permissions.
> >
> > -Marco
> >
> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <[email protected]>
> wrote:
> >>
> >> Hi Rajan,
> >>
> >> I wanted to share on Confluence, but it didn't allow me to create a new
> >> document. If my e-mail address gets permissions to add new Confluence
> >> pages, I'll transfer the contents to Confluence. Please keep me posted
> when
> >> I get edit permissions.
> >>
> >> Thanks!
> >>
> >> Marek
> >>
> >>
> >>
> >> On Mon, Jun 11, 2018 at 11:02 AM [email protected] <
> >> [email protected]> wrote:
> >>
> >>> HI Marek,
> >>>
> >>> Thanks for sharing the  document. It would be great if you could share
> it
> >>> on confluence wiki or a quip document. The formatting here makes it
> very
> >>> difficult to read a long document.
> >>>
> >>> Appreciate the help.
> >>>
> >>> Thanks
> >>> Rajan
> >>>
> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]> wrote:
> >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for
> >> open-sourcing
> >>> an
> >>>> initial integration of TensorRT as a runtime accelerator of MxNet (PR
> >> for
> >>>> discussion coming in the next few days, ETA of the first draft of the
> >> PR
> >>> is
> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
> >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
> >> provides
> >>>> significant acceleration of model inference on NVIDIA GPUs compared to
> >>>> running the full graph in MxNet using unfused GPU operators. In
> >> addition
> >>> to
> >>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is
> >> capable
> >>> of
> >>>> int8 inference (provided the quantization steps are performed).
> Besides
> >>>> increasing throughput, TensorRT significantly reduces inference
> >> latency,
> >>>> especially for small batches. See more here
> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite its benefits,
> using
> >>>> pre-trained models with TensorRT typically requires some effort -
> >> either
> >>>> re-writing the model using TensorRT’s graph building APIs, or
> >> exporting a
> >>>> model to ONNX, followed by an import step. Even if the import is
> >>> simplified
> >>>> using ONNX, the TensorRT user still needs to provide their own data
> >>>> pipeline, which used to exist in the framework, but no longer does in
> a
> >>>> stand-alone TensorRT deployment with a client application.3. TensorRT
> >> is
> >>>> very performant, but does not have the full set of MxNet’s operators.
> >>> While
> >>>> that could be addressed with TensorRT plugins, it’s much simpler to
> >> reuse
> >>>> already-exisitng MxNet operators. Also, the user shouldn’t care about
> >>>> knowing which operators are supported by TensorRT and which ones
> >> aren’t -
> >>>> runtime integration allows the graph partitioner to extract subgraphs
> >>>> capable of running inside of TensorRT, place the subgraph in a
> TensorRT
> >>>> operator in MxNet, execute that operator as part of MxNet’s graph
> >>>> execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> >>>> operators remaining after the TensorRT subgraph extraction and node
> >>>> substitution. The goal is to accelerate inference without changing
> user
> >>>> experience.Design considerations 1. Since TensorRT can only determine
> >> all
> >>>> possible optimizations once the tensor shapes are known, it is
> >> imperative
> >>>> that all the shape information be provided. This means that the best
> >> time
> >>>> to construct the TensorRT graph is bind time. The coming PR can
> >>> selectively
> >>>> apply the TensorRT optimization for inference-only graphs at symbol
> >> bind
> >>>> time. This is in fact consistent with the assumptions about TensorRT
> >> made
> >>>> on the MxNet Wiki here
> >>>> <
> >>>
> >> https://cwiki.apache.org/confluence/display/MXNET/
> Unified+integration+with+external+acceleration+libraries
> >>>> .
> >>>> 2. Since as mentioned in #1, TensorRT graph building needs shape
> >>>> information only available at bind time, an important goal was not to
> >>>> disrupt any existing APIs. Even though C++ permits default function
> >>>> arguments, the Python bindings for symbol-related methods (e.g. simple
> >>>> bind) are exposed via a C, not C++, API, wired on the Python side
> using
> >>>> Ctypes (e.g. see here
> >>>> <
> >>>
> >> https://github.com/apache/incubator-mxnet/blob/master/
> python/mxnet/symbol/symbol.py#L1486:L1521
> >>>>
> >>>> for the simple bind integration). This precludes the addition of extra
> >>>> arguments without causing breaking changes in the C API. Also,
> adapting
> >>> the
> >>>> Python code to such changes wouldn’t be enough, since all frontend
> >>>> languages use the C (not C++) API for the FFI. Fortunately, C API
> >> changes
> >>>> could be avoided, by simply letting the user enable or disable the
> >>> TensorRT
> >>>> pass using an environment variable (USE_TENSORRT=1 to enable). This
> >> also
> >>>> does not diminish the flexibility of the integration, since the graph
> >>> pass
> >>>> can read the environment variable each time symbol binding is done,
> and
> >>>> hence permits turning the graph passes on and off, depending on need.
> >> The
> >>>> ability to enable and disable the TensorRT pass at runtime also makes
> >>> unit
> >>>> testing easier.3. TensorRT requires that the workspace size is
> provided
> >>> at
> >>>> graph construction time. This value constitutes the upper limit on the
> >>>> amount of memory that TensorRT can use, and does not determine
> >> immediate
> >>>> use. Since this amount can be hard for the user to know, its limit
> >> should
> >>>> be set to a reasonable value that the user need not concern themselves
> >>>> with. Given that TensorRT integration is applied at bind time and that
> >>>> TensorRT engines wrapped in TensorRT nodes are constructed during the
> >>> graph
> >>>> pass rather than the memory allocation pass,  MxNet will only allocate
> >>> the
> >>>> amount needed for the nodes remaining after the TensorRT subgraphs
> have
> >>>> been extracted. This means that no memory will be doubly allocated -
> >>> first
> >>>> for the complete MxNet subgraph and then for TensorRT. However, the
> >>>> question remains whether the memory used per TensorRT engine should be
> >> a
> >>>> configurable parameter, either as a method argument or an environment
> >>>> variable, or whether TensorRT should be able to use the maximum
> >> available
> >>>> GPU memory and then reserve only what it needs. I would like to
> suggest
> >>> the
> >>>> latter. Since the TensorRT subgraph will typically use less memory
> than
> >>> the
> >>>> same subgraph in MxNet (due to more layer fusion), it’s extremely
> >>> unlikely
> >>>> that a model which runs purely as an MxNet graph would fail with an
> ouf
> >>> of
> >>>> memory error when parts or most of the graph run inside TensorRT.
> Fewer
> >>>> knobs (in this case, not giving the user the ability to tweak the
> >> maximum
> >>>> amount of memory availble to TensorRT would simplify use.4. TensorRT
> >> can
> >>>> accept graphs constructed using two main approaches: (a) via the
> >> TensorRT
> >>>> graph API, (b) using ONNX. Approach (a) seems simple on the surface -
> >> one
> >>>> traverses the NNVM graph, finds subgraphs that TensorRT can execute,
> >>>> converts the subgraphs to TensorRT graphs, and substitutes the
> >> subgraphs
> >>>> with TensorRT nodes, each of which contain the TensorRT engine
> >>>> corresponding to the subgraph. However, the approach taken by NVIDA
> was
> >>> to
> >>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX is a
> >> very
> >>>> well-known IR, which is supported by the entire deep learning software
> >>>> community. This ensures that the design of the IR gets as much
> feedback
> >>> as
> >>>> possible as to whether the IR is feature complete, and what the
> >> semantics
> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter (link
> >>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to do so.
> >>>> Whatever changes that may apply to the TensorRT APIs or the internal
> >>>> features may be nicely hidden behind the well-established ONNX IR.
> >>> Second,
> >>>> ONNX is growing beyond being merely an IR. As it becomes more of a
> >>>> standard, its adoption will be associated with other benefits, such as
> >>> the
> >>>> ability to verify standard compliance.5. Despite the advantages of
> >> using
> >>>> the ONNX route described in #4, there are some costs. The main one is
> >>> the
> >>>> dependency on Protobuf. This is a valid criticism on the surface,
> >>> however,
> >>>> since the TensorRT integration requires an opt-in during build time,
> >>> adding
> >>>> one more dependency is not a problem if it is not a mandatory
> >> dependency.
> >>>> Moreover, the same Protobuf dependency already exists for the MxNet
> >> ONNX
> >>>> importer, which is now part of the MxNet source tree (link
> >>>> <
> >>>
> >> https://github.com/apache/incubator-mxnet/blob/
> 76417594e56a85ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
> >>>> ),
> >>>> rather than being located in a separate repository. Just like the use
> >> of
> >>>> the ONNX importer is optional and requires ONNX (and hence also
> >>> Protobuf),
> >>>> the TensorRT build is optional. 6. The optional integration of
> TensorRT
> >>>> will be guarded using a config.mk <http://config.mk> flag
> >>> (USE_TENSORRT),
> >>>> which will function similarly to other flags, such as USE_CUDA,
> >>> USE_CUDNN,
> >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7. In
> >>>> order to simplify evaluation of the TensorRT build, usability and to
> >> run
> >>>> unit tests, the PR will come with a Dockerfile, which will allow
> anyone
> >>> to
> >>>> build MxNet with TensorRT, along with its dependencies, i.e. Protobuf
> >> and
> >>>> ONNX. APIs / user experienceThere is no change in the inference APIs,
> >>>> except for the need to set the MXNET_USE_TENSORRT environment variable
> >> to
> >>>> 1. For example, in Python, we can simply
> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
> >>>> compatibility, if the environment variable is not set, it will default
> >> to
> >>>> 0. Also, unlike some other environment variables that are only checked
> >>>> during MxNet initialization, this one gets checked every time graph
> >>> binding
> >>>> happens. This typically happens only once during the inference
> >>>> application’s life cycle, but since one can re-bind a symbol to say
> >>> compare
> >>>> a TensorRT and a non-TensorRT run, the check will happen during each
> >>>> bind/re-bind to enable that. Since the TensorRT graph pass is enabled
> >>> using
> >>>> an environment variable, no break in the C++, C or any frontend
> >> language
> >>>> API is needed. Note that there is one more change required - in
> calling
> >>>> simple bind. This doesn’t change the simple bind API, but how it’s
> >> called
> >>>> relative to the “usual” case, by using some of the arguments which are
> >>>> optional. This has to do with the shared_buffer parameter. Before
> >>>> explaining how the call changes, let’s consider why it’s necessary: 1.
> >>> The
> >>>> TensorRT graph needs to be constructed during the simple bind call,
> but
> >>>> before memory gets allocated for the non-TensorRT part of the graph.
> 2.
> >>>> TensorRT needs the weights, not just the shapes, to be provided before
> >>> the
> >>>> engine is constructed - it will store them inside the ICudaEngine
> >> object.
> >>>> The engine will then be serialized inside the NNVM TensorRT op, and
> >>>> deserialized when the graph executor takes over. This means that the
> >>>> weights need to be provided to the simple bind call to construct the
> >>>> TensorRT engine.3. The way to provide the weights is to hand them over
> >> to
> >>>> the simple bind call via the “shared buffer” argument. The shared
> >> buffer
> >>>> weights can be provided during the bind call and can be freed by the
> >>>> frontend language once binding is complete (e.g. by exiting the
> >> relevant
> >>>> scope in Python, or calling del).Since we need both arg_params
> >> (weights)
> >>>> and aux_params (e.g. BatchNorm moments), we need to merge arg_params
> >> and
> >>>> aux_params into one dictionary. Here’s a Python example:def
> >>>> merge_dicts(*dict_args):    """Merge arg_params and aux_params to
> >>> populate
> >>>> shared_buffer"""    result = {}    for dictionary in dict_args:
> >>>>       result.update(dictionary)    return resultNow let’s see a use
> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params =
> >>>>   mx.model.load_checkpoint(model_name, num_epochs)executor =
> >>>> sym.simple_bind(ctx=device,    data=data_shape,
> >>>>   softmax_label=(batch_size,),
> >> shared_buffer=merge_dicts(arg_params,
> >>>> aux_params),,    grad_req='null',    force_rebind=True)Now we can
> >> simply
> >>>> update data in the executor’s arg dict and run the forward
> >>>> pass:executor.arg_dict["data"][:] =
> >>>> my_data_batchexecutor.forward(is_train=False)predictions =
> >>>> executor.outputs[0].asnumpy()Limitations of initial integration and
> >>>> suggested future work 1. Since the new accelerator API proposal (link
> >>>> <
> >>>
> >> https://cwiki.apache.org/confluence/display/MXNET/
> Unified+integration+with+external+acceleration+libraries
> >>>> )
> >>>> was only published a few days ago and the implementation is still on
> an
> >>>> MxNet fork, the current TensorRT integration doesn’t use that API yet,
> >>> but
> >>>> could be refactored in a future commit to use it. There is nothing in
> >> the
> >>>> current design that would prevent making use of that API in the near
> >>>> future.2. Building the TensorRT engine takes a non-trivial amount of
> >>> time,
> >>>> because the compiler evaluates performance and the hardware on the
> >> system
> >>>> before creating the fused layers on demand, and then needs to actually
> >>>> compile them. For ResNet-50 this may be a few seconds, but larger
> >> models
> >>>> also exist which may take longer. TensorRT comes with the ability to
> >>>> serialize the TensorRT engine for a particular hardware platform. This
> >> is
> >>>> called the serialization of a TensorRT plan, which is the engine along
> >>> with
> >>>> the ahead-of-time-compiled fused kernels for a given GPU. The first PR
> >> of
> >>>> the TensorRT integration will not provide for TensorRT plan caching,
> so
> >>>> using TensorRT might have a small start-up cost, but for long-running
> >>>> inference processes, this shouldn’t be a problem. Caching the TensorRT
> >>> plan
> >>>> will be addressed in a future commit.3. As mentioned before, the
> >>>> reproducibility of the build will be demonstrated using a Docker file
> >>> that
> >>>> will provide an easy way to evaluate the build. The Docker recipe was
> >>>> tested on Linux on x86_64, but not other platforms supported by
> >> TensorRT
> >>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on aarch64).
> >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e.
> Linux
> >>> for
> >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent commits.
> >> 4.
> >>>> The current commit supports many, but not all, of TensorRT operators.
> >> For
> >>>> example, this integration can run CNNs such as VGG, or ResNet, but not
> >>>> necessarily everything that TensorRT can support. More operators will
> >> be
> >>>> covered in future commits.5. TensorRT supports plugins, which can be
> >>>> integrated into the graph pass. However, this was not a priority since
> >>> the
> >>>> runtime TensorRT integration can always fall back to existing MxNet
> >>>> operators. Supporting plugins is possible, but will be added in future
> >>>> commits.6. The upcoming PR will support fp16 and fp32, but not int8.
> >>> Since
> >>>> int8 support in MxNet is itself very new, figuring out calibration and
> >>>> other details is left for a future commit.7. TensorRT 4 is going to
> >> have
> >>> a
> >>>> new feature called BYOM (bring your own memory). This means that
> >> instead
> >>> of
> >>>> telling TensorRT how much memory it can use, the data/scratch space
> >>> tensors
> >>>> can be provided by MxNet, and can be re-used by MxNet when not running
> >>> the
> >>>> forward pass. The memory in permanent use will then be limited to
> >>> TensorRT
> >>>> storing weights. Support for this feature will be added in a future
> >>> commit.*
> >>>>
> >>>
> >>
>

Re: Details regarding upcoming PR for runtime TensorRT integration

Reply via email to