Re: Details regarding upcoming PR for runtime TensorRT integration

Hagay Lupesko Mon, 11 Jun 2018 18:30:02 -0700

+1 for reviewing a design doc.

Naveen - why do you see it sit under ONNX? Isn't it a broader topic of GPU
acceleration?


Hagay

On Mon, Jun 11, 2018, 12:56 Naveen Swamy <[email protected]> wrote:

> please add your proposal under design proposals, once the community has
> reviewed and there is consensus on the approach we can create a ONNX-MXNet
> sub section and move there.
>
> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy <[email protected]> wrote:
>
> > you have access now.
> >
> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <[email protected]>
> wrote:
> >
> >> I'll add in about an hour
> >>
> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
> >> [email protected]> wrote:
> >> >
> >> > I don't know how to grant permission on Confluence. If somebody else
> >> knows
> >> > how to do so, please grant Marek the edit permissions.
> >> >
> >> > -Marco
> >> >
> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <[email protected]>
> >> wrote:
> >> >>
> >> >> Hi Rajan,
> >> >>
> >> >> I wanted to share on Confluence, but it didn't allow me to create a
> new
> >> >> document. If my e-mail address gets permissions to add new Confluence
> >> >> pages, I'll transfer the contents to Confluence. Please keep me
> posted
> >> when
> >> >> I get edit permissions.
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Marek
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 11, 2018 at 11:02 AM [email protected] <
> >> >> [email protected]> wrote:
> >> >>
> >> >>> HI Marek,
> >> >>>
> >> >>> Thanks for sharing the  document. It would be great if you could
> >> share it
> >> >>> on confluence wiki or a quip document. The formatting here makes it
> >> very
> >> >>> difficult to read a long document.
> >> >>>
> >> >>> Appreciate the help.
> >> >>>
> >> >>> Thanks
> >> >>> Rajan
> >> >>>
> >> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]> wrote:
> >> >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for
> >> >> open-sourcing
> >> >>> an
> >> >>>> initial integration of TensorRT as a runtime accelerator of MxNet
> (PR
> >> >> for
> >> >>>> discussion coming in the next few days, ETA of the first draft of
> the
> >> >> PR
> >> >>> is
> >> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
> >> >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
> >> >> provides
> >> >>>> significant acceleration of model inference on NVIDIA GPUs compared
> >> to
> >> >>>> running the full graph in MxNet using unfused GPU operators. In
> >> >> addition
> >> >>> to
> >> >>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is
> >> >> capable
> >> >>> of
> >> >>>> int8 inference (provided the quantization steps are performed).
> >> Besides
> >> >>>> increasing throughput, TensorRT significantly reduces inference
> >> >> latency,
> >> >>>> especially for small batches. See more here
> >> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite its benefits,
> >> using
> >> >>>> pre-trained models with TensorRT typically requires some effort -
> >> >> either
> >> >>>> re-writing the model using TensorRT’s graph building APIs, or
> >> >> exporting a
> >> >>>> model to ONNX, followed by an import step. Even if the import is
> >> >>> simplified
> >> >>>> using ONNX, the TensorRT user still needs to provide their own data
> >> >>>> pipeline, which used to exist in the framework, but no longer does
> >> in a
> >> >>>> stand-alone TensorRT deployment with a client application.3.
> TensorRT
> >> >> is
> >> >>>> very performant, but does not have the full set of MxNet’s
> operators.
> >> >>> While
> >> >>>> that could be addressed with TensorRT plugins, it’s much simpler to
> >> >> reuse
> >> >>>> already-exisitng MxNet operators. Also, the user shouldn’t care
> about
> >> >>>> knowing which operators are supported by TensorRT and which ones
> >> >> aren’t -
> >> >>>> runtime integration allows the graph partitioner to extract
> subgraphs
> >> >>>> capable of running inside of TensorRT, place the subgraph in a
> >> TensorRT
> >> >>>> operator in MxNet, execute that operator as part of MxNet’s graph
> >> >>>> execusion, and handle non-TensorRT-compatible nodes as regular
> MxNet
> >> >>>> operators remaining after the TensorRT subgraph extraction and node
> >> >>>> substitution. The goal is to accelerate inference without changing
> >> user
> >> >>>> experience.Design considerations 1. Since TensorRT can only
> determine
> >> >> all
> >> >>>> possible optimizations once the tensor shapes are known, it is
> >> >> imperative
> >> >>>> that all the shape information be provided. This means that the
> best
> >> >> time
> >> >>>> to construct the TensorRT graph is bind time. The coming PR can
> >> >>> selectively
> >> >>>> apply the TensorRT optimization for inference-only graphs at symbol
> >> >> bind
> >> >>>> time. This is in fact consistent with the assumptions about
> TensorRT
> >> >> made
> >> >>>> on the MxNet Wiki here
> >> >>>> <
> >> >>>
> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> integration+with+external+acceleration+libraries
> >> >>>> .
> >> >>>> 2. Since as mentioned in #1, TensorRT graph building needs shape
> >> >>>> information only available at bind time, an important goal was not
> to
> >> >>>> disrupt any existing APIs. Even though C++ permits default function
> >> >>>> arguments, the Python bindings for symbol-related methods (e.g.
> >> simple
> >> >>>> bind) are exposed via a C, not C++, API, wired on the Python side
> >> using
> >> >>>> Ctypes (e.g. see here
> >> >>>> <
> >> >>>
> >> >> https://github.com/apache/incubator-mxnet/blob/master/python
> >> /mxnet/symbol/symbol.py#L1486:L1521
> >> >>>>
> >> >>>> for the simple bind integration). This precludes the addition of
> >> extra
> >> >>>> arguments without causing breaking changes in the C API. Also,
> >> adapting
> >> >>> the
> >> >>>> Python code to such changes wouldn’t be enough, since all frontend
> >> >>>> languages use the C (not C++) API for the FFI. Fortunately, C API
> >> >> changes
> >> >>>> could be avoided, by simply letting the user enable or disable the
> >> >>> TensorRT
> >> >>>> pass using an environment variable (USE_TENSORRT=1 to enable). This
> >> >> also
> >> >>>> does not diminish the flexibility of the integration, since the
> graph
> >> >>> pass
> >> >>>> can read the environment variable each time symbol binding is done,
> >> and
> >> >>>> hence permits turning the graph passes on and off, depending on
> need.
> >> >> The
> >> >>>> ability to enable and disable the TensorRT pass at runtime also
> makes
> >> >>> unit
> >> >>>> testing easier.3. TensorRT requires that the workspace size is
> >> provided
> >> >>> at
> >> >>>> graph construction time. This value constitutes the upper limit on
> >> the
> >> >>>> amount of memory that TensorRT can use, and does not determine
> >> >> immediate
> >> >>>> use. Since this amount can be hard for the user to know, its limit
> >> >> should
> >> >>>> be set to a reasonable value that the user need not concern
> >> themselves
> >> >>>> with. Given that TensorRT integration is applied at bind time and
> >> that
> >> >>>> TensorRT engines wrapped in TensorRT nodes are constructed during
> the
> >> >>> graph
> >> >>>> pass rather than the memory allocation pass,  MxNet will only
> >> allocate
> >> >>> the
> >> >>>> amount needed for the nodes remaining after the TensorRT subgraphs
> >> have
> >> >>>> been extracted. This means that no memory will be doubly allocated
> -
> >> >>> first
> >> >>>> for the complete MxNet subgraph and then for TensorRT. However, the
> >> >>>> question remains whether the memory used per TensorRT engine should
> >> be
> >> >> a
> >> >>>> configurable parameter, either as a method argument or an
> environment
> >> >>>> variable, or whether TensorRT should be able to use the maximum
> >> >> available
> >> >>>> GPU memory and then reserve only what it needs. I would like to
> >> suggest
> >> >>> the
> >> >>>> latter. Since the TensorRT subgraph will typically use less memory
> >> than
> >> >>> the
> >> >>>> same subgraph in MxNet (due to more layer fusion), it’s extremely
> >> >>> unlikely
> >> >>>> that a model which runs purely as an MxNet graph would fail with an
> >> ouf
> >> >>> of
> >> >>>> memory error when parts or most of the graph run inside TensorRT.
> >> Fewer
> >> >>>> knobs (in this case, not giving the user the ability to tweak the
> >> >> maximum
> >> >>>> amount of memory availble to TensorRT would simplify use.4.
> TensorRT
> >> >> can
> >> >>>> accept graphs constructed using two main approaches: (a) via the
> >> >> TensorRT
> >> >>>> graph API, (b) using ONNX. Approach (a) seems simple on the
> surface -
> >> >> one
> >> >>>> traverses the NNVM graph, finds subgraphs that TensorRT can
> execute,
> >> >>>> converts the subgraphs to TensorRT graphs, and substitutes the
> >> >> subgraphs
> >> >>>> with TensorRT nodes, each of which contain the TensorRT engine
> >> >>>> corresponding to the subgraph. However, the approach taken by NVIDA
> >> was
> >> >>> to
> >> >>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX is
> a
> >> >> very
> >> >>>> well-known IR, which is supported by the entire deep learning
> >> software
> >> >>>> community. This ensures that the design of the IR gets as much
> >> feedback
> >> >>> as
> >> >>>> possible as to whether the IR is feature complete, and what the
> >> >> semantics
> >> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter (link
> >> >>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to do
> >> so.
> >> >>>> Whatever changes that may apply to the TensorRT APIs or the
> internal
> >> >>>> features may be nicely hidden behind the well-established ONNX IR.
> >> >>> Second,
> >> >>>> ONNX is growing beyond being merely an IR. As it becomes more of a
> >> >>>> standard, its adoption will be associated with other benefits, such
> >> as
> >> >>> the
> >> >>>> ability to verify standard compliance.5. Despite the advantages of
> >> >> using
> >> >>>> the ONNX route described in #4, there are some costs. The main one
> is
> >> >>> the
> >> >>>> dependency on Protobuf. This is a valid criticism on the surface,
> >> >>> however,
> >> >>>> since the TensorRT integration requires an opt-in during build
> time,
> >> >>> adding
> >> >>>> one more dependency is not a problem if it is not a mandatory
> >> >> dependency.
> >> >>>> Moreover, the same Protobuf dependency already exists for the MxNet
> >> >> ONNX
> >> >>>> importer, which is now part of the MxNet source tree (link
> >> >>>> <
> >> >>>
> >> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a8
> >> 5ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
> >> >>>> ),
> >> >>>> rather than being located in a separate repository. Just like the
> use
> >> >> of
> >> >>>> the ONNX importer is optional and requires ONNX (and hence also
> >> >>> Protobuf),
> >> >>>> the TensorRT build is optional. 6. The optional integration of
> >> TensorRT
> >> >>>> will be guarded using a config.mk <http://config.mk> flag
> >> >>> (USE_TENSORRT),
> >> >>>> which will function similarly to other flags, such as USE_CUDA,
> >> >>> USE_CUDNN,
> >> >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7.
> >> In
> >> >>>> order to simplify evaluation of the TensorRT build, usability and
> to
> >> >> run
> >> >>>> unit tests, the PR will come with a Dockerfile, which will allow
> >> anyone
> >> >>> to
> >> >>>> build MxNet with TensorRT, along with its dependencies, i.e.
> Protobuf
> >> >> and
> >> >>>> ONNX. APIs / user experienceThere is no change in the inference
> APIs,
> >> >>>> except for the need to set the MXNET_USE_TENSORRT environment
> >> variable
> >> >> to
> >> >>>> 1. For example, in Python, we can simply
> >> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
> >> >>>> compatibility, if the environment variable is not set, it will
> >> default
> >> >> to
> >> >>>> 0. Also, unlike some other environment variables that are only
> >> checked
> >> >>>> during MxNet initialization, this one gets checked every time graph
> >> >>> binding
> >> >>>> happens. This typically happens only once during the inference
> >> >>>> application’s life cycle, but since one can re-bind a symbol to say
> >> >>> compare
> >> >>>> a TensorRT and a non-TensorRT run, the check will happen during
> each
> >> >>>> bind/re-bind to enable that. Since the TensorRT graph pass is
> enabled
> >> >>> using
> >> >>>> an environment variable, no break in the C++, C or any frontend
> >> >> language
> >> >>>> API is needed. Note that there is one more change required - in
> >> calling
> >> >>>> simple bind. This doesn’t change the simple bind API, but how it’s
> >> >> called
> >> >>>> relative to the “usual” case, by using some of the arguments which
> >> are
> >> >>>> optional. This has to do with the shared_buffer parameter. Before
> >> >>>> explaining how the call changes, let’s consider why it’s necessary:
> >> 1.
> >> >>> The
> >> >>>> TensorRT graph needs to be constructed during the simple bind call,
> >> but
> >> >>>> before memory gets allocated for the non-TensorRT part of the
> graph.
> >> 2.
> >> >>>> TensorRT needs the weights, not just the shapes, to be provided
> >> before
> >> >>> the
> >> >>>> engine is constructed - it will store them inside the ICudaEngine
> >> >> object.
> >> >>>> The engine will then be serialized inside the NNVM TensorRT op, and
> >> >>>> deserialized when the graph executor takes over. This means that
> the
> >> >>>> weights need to be provided to the simple bind call to construct
> the
> >> >>>> TensorRT engine.3. The way to provide the weights is to hand them
> >> over
> >> >> to
> >> >>>> the simple bind call via the “shared buffer” argument. The shared
> >> >> buffer
> >> >>>> weights can be provided during the bind call and can be freed by
> the
> >> >>>> frontend language once binding is complete (e.g. by exiting the
> >> >> relevant
> >> >>>> scope in Python, or calling del).Since we need both arg_params
> >> >> (weights)
> >> >>>> and aux_params (e.g. BatchNorm moments), we need to merge
> arg_params
> >> >> and
> >> >>>> aux_params into one dictionary. Here’s a Python example:def
> >> >>>> merge_dicts(*dict_args):    """Merge arg_params and aux_params to
> >> >>> populate
> >> >>>> shared_buffer"""    result = {}    for dictionary in dict_args:
> >> >>>>       result.update(dictionary)    return resultNow let’s see a use
> >> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params =
> >> >>>>   mx.model.load_checkpoint(model_name, num_epochs)executor =
> >> >>>> sym.simple_bind(ctx=device,    data=data_shape,
> >> >>>>   softmax_label=(batch_size,),
> >> >> shared_buffer=merge_dicts(arg_params,
> >> >>>> aux_params),,    grad_req='null',    force_rebind=True)Now we can
> >> >> simply
> >> >>>> update data in the executor’s arg dict and run the forward
> >> >>>> pass:executor.arg_dict["data"][:] =
> >> >>>> my_data_batchexecutor.forward(is_train=False)predictions =
> >> >>>> executor.outputs[0].asnumpy()Limitations of initial integration and
> >> >>>> suggested future work 1. Since the new accelerator API proposal
> (link
> >> >>>> <
> >> >>>
> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> integration+with+external+acceleration+libraries
> >> >>>> )
> >> >>>> was only published a few days ago and the implementation is still
> on
> >> an
> >> >>>> MxNet fork, the current TensorRT integration doesn’t use that API
> >> yet,
> >> >>> but
> >> >>>> could be refactored in a future commit to use it. There is nothing
> in
> >> >> the
> >> >>>> current design that would prevent making use of that API in the
> near
> >> >>>> future.2. Building the TensorRT engine takes a non-trivial amount
> of
> >> >>> time,
> >> >>>> because the compiler evaluates performance and the hardware on the
> >> >> system
> >> >>>> before creating the fused layers on demand, and then needs to
> >> actually
> >> >>>> compile them. For ResNet-50 this may be a few seconds, but larger
> >> >> models
> >> >>>> also exist which may take longer. TensorRT comes with the ability
> to
> >> >>>> serialize the TensorRT engine for a particular hardware platform.
> >> This
> >> >> is
> >> >>>> called the serialization of a TensorRT plan, which is the engine
> >> along
> >> >>> with
> >> >>>> the ahead-of-time-compiled fused kernels for a given GPU. The first
> >> PR
> >> >> of
> >> >>>> the TensorRT integration will not provide for TensorRT plan
> caching,
> >> so
> >> >>>> using TensorRT might have a small start-up cost, but for
> long-running
> >> >>>> inference processes, this shouldn’t be a problem. Caching the
> >> TensorRT
> >> >>> plan
> >> >>>> will be addressed in a future commit.3. As mentioned before, the
> >> >>>> reproducibility of the build will be demonstrated using a Docker
> file
> >> >>> that
> >> >>>> will provide an easy way to evaluate the build. The Docker recipe
> was
> >> >>>> tested on Linux on x86_64, but not other platforms supported by
> >> >> TensorRT
> >> >>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on
> aarch64).
> >> >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e.
> >> Linux
> >> >>> for
> >> >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent
> commits.
> >> >> 4.
> >> >>>> The current commit supports many, but not all, of TensorRT
> operators.
> >> >> For
> >> >>>> example, this integration can run CNNs such as VGG, or ResNet, but
> >> not
> >> >>>> necessarily everything that TensorRT can support. More operators
> will
> >> >> be
> >> >>>> covered in future commits.5. TensorRT supports plugins, which can
> be
> >> >>>> integrated into the graph pass. However, this was not a priority
> >> since
> >> >>> the
> >> >>>> runtime TensorRT integration can always fall back to existing MxNet
> >> >>>> operators. Supporting plugins is possible, but will be added in
> >> future
> >> >>>> commits.6. The upcoming PR will support fp16 and fp32, but not
> int8.
> >> >>> Since
> >> >>>> int8 support in MxNet is itself very new, figuring out calibration
> >> and
> >> >>>> other details is left for a future commit.7. TensorRT 4 is going to
> >> >> have
> >> >>> a
> >> >>>> new feature called BYOM (bring your own memory). This means that
> >> >> instead
> >> >>> of
> >> >>>> telling TensorRT how much memory it can use, the data/scratch space
> >> >>> tensors
> >> >>>> can be provided by MxNet, and can be re-used by MxNet when not
> >> running
> >> >>> the
> >> >>>> forward pass. The memory in permanent use will then be limited to
> >> >>> TensorRT
> >> >>>> storing weights. Support for this feature will be added in a future
> >> >>> commit.*
> >> >>>>
> >> >>>
> >> >>
> >>
> >
> >
>

Re: Details regarding upcoming PR for runtime TensorRT integration

Reply via email to