Re: Details regarding upcoming PR for runtime TensorRT integration

Da Zheng Mon, 11 Jun 2018 19:17:26 -0700

Hello Marek,

Thank you for your detailed design doc. My understanding is that the
current implementation is to convert an NNVM graph to an ONNX graph
and load the ONNX graph to TensorRT.
What is unclear to me is how an operator unsupported by TensorRT is
handled in this strategy. It seems you fall back to the MXNet
operators. Your current solution partitions a graph and loads
subgraphs to TensorRT? If so, why do you need to convert a partitioned
subgraph to ONNX first? If you convert the entire NNVM graph to ONNX,
could you describe in more details how to fall back to MXNet
operators?


Thanks,
Da


On Mon, Jun 11, 2018 at 6:29 PM, Hagay Lupesko <[email protected]> wrote:
> +1 for reviewing a design doc.
>
> Naveen - why do you see it sit under ONNX? Isn't it a broader topic of GPU
> acceleration?
>
> Hagay
>
> On Mon, Jun 11, 2018, 12:56 Naveen Swamy <[email protected]> wrote:
>
>> please add your proposal under design proposals, once the community has
>> reviewed and there is consensus on the approach we can create a ONNX-MXNet
>> sub section and move there.
>>
>> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy <[email protected]> wrote:
>>
>> > you have access now.
>> >
>> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <[email protected]>
>> wrote:
>> >
>> >> I'll add in about an hour
>> >>
>> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
>> >> [email protected]> wrote:
>> >> >
>> >> > I don't know how to grant permission on Confluence. If somebody else
>> >> knows
>> >> > how to do so, please grant Marek the edit permissions.
>> >> >
>> >> > -Marco
>> >> >
>> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <[email protected]>
>> >> wrote:
>> >> >>
>> >> >> Hi Rajan,
>> >> >>
>> >> >> I wanted to share on Confluence, but it didn't allow me to create a
>> new
>> >> >> document. If my e-mail address gets permissions to add new Confluence
>> >> >> pages, I'll transfer the contents to Confluence. Please keep me
>> posted
>> >> when
>> >> >> I get edit permissions.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Marek
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Jun 11, 2018 at 11:02 AM [email protected] <
>> >> >> [email protected]> wrote:
>> >> >>
>> >> >>> HI Marek,
>> >> >>>
>> >> >>> Thanks for sharing the  document. It would be great if you could
>> >> share it
>> >> >>> on confluence wiki or a quip document. The formatting here makes it
>> >> very
>> >> >>> difficult to read a long document.
>> >> >>>
>> >> >>> Appreciate the help.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Rajan
>> >> >>>
>> >> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]> wrote:
>> >> >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for
>> >> >> open-sourcing
>> >> >>> an
>> >> >>>> initial integration of TensorRT as a runtime accelerator of MxNet
>> (PR
>> >> >> for
>> >> >>>> discussion coming in the next few days, ETA of the first draft of
>> the
>> >> >> PR
>> >> >>> is
>> >> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
>> >> >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
>> >> >> provides
>> >> >>>> significant acceleration of model inference on NVIDIA GPUs compared
>> >> to
>> >> >>>> running the full graph in MxNet using unfused GPU operators. In
>> >> >> addition
>> >> >>> to
>> >> >>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is
>> >> >> capable
>> >> >>> of
>> >> >>>> int8 inference (provided the quantization steps are performed).
>> >> Besides
>> >> >>>> increasing throughput, TensorRT significantly reduces inference
>> >> >> latency,
>> >> >>>> especially for small batches. See more here
>> >> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite its benefits,
>> >> using
>> >> >>>> pre-trained models with TensorRT typically requires some effort -
>> >> >> either
>> >> >>>> re-writing the model using TensorRT’s graph building APIs, or
>> >> >> exporting a
>> >> >>>> model to ONNX, followed by an import step. Even if the import is
>> >> >>> simplified
>> >> >>>> using ONNX, the TensorRT user still needs to provide their own data
>> >> >>>> pipeline, which used to exist in the framework, but no longer does
>> >> in a
>> >> >>>> stand-alone TensorRT deployment with a client application.3.
>> TensorRT
>> >> >> is
>> >> >>>> very performant, but does not have the full set of MxNet’s
>> operators.
>> >> >>> While
>> >> >>>> that could be addressed with TensorRT plugins, it’s much simpler to
>> >> >> reuse
>> >> >>>> already-exisitng MxNet operators. Also, the user shouldn’t care
>> about
>> >> >>>> knowing which operators are supported by TensorRT and which ones
>> >> >> aren’t -
>> >> >>>> runtime integration allows the graph partitioner to extract
>> subgraphs
>> >> >>>> capable of running inside of TensorRT, place the subgraph in a
>> >> TensorRT
>> >> >>>> operator in MxNet, execute that operator as part of MxNet’s graph
>> >> >>>> execusion, and handle non-TensorRT-compatible nodes as regular
>> MxNet
>> >> >>>> operators remaining after the TensorRT subgraph extraction and node
>> >> >>>> substitution. The goal is to accelerate inference without changing
>> >> user
>> >> >>>> experience.Design considerations 1. Since TensorRT can only
>> determine
>> >> >> all
>> >> >>>> possible optimizations once the tensor shapes are known, it is
>> >> >> imperative
>> >> >>>> that all the shape information be provided. This means that the
>> best
>> >> >> time
>> >> >>>> to construct the TensorRT graph is bind time. The coming PR can
>> >> >>> selectively
>> >> >>>> apply the TensorRT optimization for inference-only graphs at symbol
>> >> >> bind
>> >> >>>> time. This is in fact consistent with the assumptions about
>> TensorRT
>> >> >> made
>> >> >>>> on the MxNet Wiki here
>> >> >>>> <
>> >> >>>
>> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
>> >> integration+with+external+acceleration+libraries
>> >> >>>> .
>> >> >>>> 2. Since as mentioned in #1, TensorRT graph building needs shape
>> >> >>>> information only available at bind time, an important goal was not
>> to
>> >> >>>> disrupt any existing APIs. Even though C++ permits default function
>> >> >>>> arguments, the Python bindings for symbol-related methods (e.g.
>> >> simple
>> >> >>>> bind) are exposed via a C, not C++, API, wired on the Python side
>> >> using
>> >> >>>> Ctypes (e.g. see here
>> >> >>>> <
>> >> >>>
>> >> >> https://github.com/apache/incubator-mxnet/blob/master/python
>> >> /mxnet/symbol/symbol.py#L1486:L1521
>> >> >>>>
>> >> >>>> for the simple bind integration). This precludes the addition of
>> >> extra
>> >> >>>> arguments without causing breaking changes in the C API. Also,
>> >> adapting
>> >> >>> the
>> >> >>>> Python code to such changes wouldn’t be enough, since all frontend
>> >> >>>> languages use the C (not C++) API for the FFI. Fortunately, C API
>> >> >> changes
>> >> >>>> could be avoided, by simply letting the user enable or disable the
>> >> >>> TensorRT
>> >> >>>> pass using an environment variable (USE_TENSORRT=1 to enable). This
>> >> >> also
>> >> >>>> does not diminish the flexibility of the integration, since the
>> graph
>> >> >>> pass
>> >> >>>> can read the environment variable each time symbol binding is done,
>> >> and
>> >> >>>> hence permits turning the graph passes on and off, depending on
>> need.
>> >> >> The
>> >> >>>> ability to enable and disable the TensorRT pass at runtime also
>> makes
>> >> >>> unit
>> >> >>>> testing easier.3. TensorRT requires that the workspace size is
>> >> provided
>> >> >>> at
>> >> >>>> graph construction time. This value constitutes the upper limit on
>> >> the
>> >> >>>> amount of memory that TensorRT can use, and does not determine
>> >> >> immediate
>> >> >>>> use. Since this amount can be hard for the user to know, its limit
>> >> >> should
>> >> >>>> be set to a reasonable value that the user need not concern
>> >> themselves
>> >> >>>> with. Given that TensorRT integration is applied at bind time and
>> >> that
>> >> >>>> TensorRT engines wrapped in TensorRT nodes are constructed during
>> the
>> >> >>> graph
>> >> >>>> pass rather than the memory allocation pass,  MxNet will only
>> >> allocate
>> >> >>> the
>> >> >>>> amount needed for the nodes remaining after the TensorRT subgraphs
>> >> have
>> >> >>>> been extracted. This means that no memory will be doubly allocated
>> -
>> >> >>> first
>> >> >>>> for the complete MxNet subgraph and then for TensorRT. However, the
>> >> >>>> question remains whether the memory used per TensorRT engine should
>> >> be
>> >> >> a
>> >> >>>> configurable parameter, either as a method argument or an
>> environment
>> >> >>>> variable, or whether TensorRT should be able to use the maximum
>> >> >> available
>> >> >>>> GPU memory and then reserve only what it needs. I would like to
>> >> suggest
>> >> >>> the
>> >> >>>> latter. Since the TensorRT subgraph will typically use less memory
>> >> than
>> >> >>> the
>> >> >>>> same subgraph in MxNet (due to more layer fusion), it’s extremely
>> >> >>> unlikely
>> >> >>>> that a model which runs purely as an MxNet graph would fail with an
>> >> ouf
>> >> >>> of
>> >> >>>> memory error when parts or most of the graph run inside TensorRT.
>> >> Fewer
>> >> >>>> knobs (in this case, not giving the user the ability to tweak the
>> >> >> maximum
>> >> >>>> amount of memory availble to TensorRT would simplify use.4.
>> TensorRT
>> >> >> can
>> >> >>>> accept graphs constructed using two main approaches: (a) via the
>> >> >> TensorRT
>> >> >>>> graph API, (b) using ONNX. Approach (a) seems simple on the
>> surface -
>> >> >> one
>> >> >>>> traverses the NNVM graph, finds subgraphs that TensorRT can
>> execute,
>> >> >>>> converts the subgraphs to TensorRT graphs, and substitutes the
>> >> >> subgraphs
>> >> >>>> with TensorRT nodes, each of which contain the TensorRT engine
>> >> >>>> corresponding to the subgraph. However, the approach taken by NVIDA
>> >> was
>> >> >>> to
>> >> >>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX is
>> a
>> >> >> very
>> >> >>>> well-known IR, which is supported by the entire deep learning
>> >> software
>> >> >>>> community. This ensures that the design of the IR gets as much
>> >> feedback
>> >> >>> as
>> >> >>>> possible as to whether the IR is feature complete, and what the
>> >> >> semantics
>> >> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter (link
>> >> >>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to do
>> >> so.
>> >> >>>> Whatever changes that may apply to the TensorRT APIs or the
>> internal
>> >> >>>> features may be nicely hidden behind the well-established ONNX IR.
>> >> >>> Second,
>> >> >>>> ONNX is growing beyond being merely an IR. As it becomes more of a
>> >> >>>> standard, its adoption will be associated with other benefits, such
>> >> as
>> >> >>> the
>> >> >>>> ability to verify standard compliance.5. Despite the advantages of
>> >> >> using
>> >> >>>> the ONNX route described in #4, there are some costs. The main one
>> is
>> >> >>> the
>> >> >>>> dependency on Protobuf. This is a valid criticism on the surface,
>> >> >>> however,
>> >> >>>> since the TensorRT integration requires an opt-in during build
>> time,
>> >> >>> adding
>> >> >>>> one more dependency is not a problem if it is not a mandatory
>> >> >> dependency.
>> >> >>>> Moreover, the same Protobuf dependency already exists for the MxNet
>> >> >> ONNX
>> >> >>>> importer, which is now part of the MxNet source tree (link
>> >> >>>> <
>> >> >>>
>> >> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a8
>> >> 5ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
>> >> >>>> ),
>> >> >>>> rather than being located in a separate repository. Just like the
>> use
>> >> >> of
>> >> >>>> the ONNX importer is optional and requires ONNX (and hence also
>> >> >>> Protobuf),
>> >> >>>> the TensorRT build is optional. 6. The optional integration of
>> >> TensorRT
>> >> >>>> will be guarded using a config.mk <http://config.mk> flag
>> >> >>> (USE_TENSORRT),
>> >> >>>> which will function similarly to other flags, such as USE_CUDA,
>> >> >>> USE_CUDNN,
>> >> >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7.
>> >> In
>> >> >>>> order to simplify evaluation of the TensorRT build, usability and
>> to
>> >> >> run
>> >> >>>> unit tests, the PR will come with a Dockerfile, which will allow
>> >> anyone
>> >> >>> to
>> >> >>>> build MxNet with TensorRT, along with its dependencies, i.e.
>> Protobuf
>> >> >> and
>> >> >>>> ONNX. APIs / user experienceThere is no change in the inference
>> APIs,
>> >> >>>> except for the need to set the MXNET_USE_TENSORRT environment
>> >> variable
>> >> >> to
>> >> >>>> 1. For example, in Python, we can simply
>> >> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
>> >> >>>> compatibility, if the environment variable is not set, it will
>> >> default
>> >> >> to
>> >> >>>> 0. Also, unlike some other environment variables that are only
>> >> checked
>> >> >>>> during MxNet initialization, this one gets checked every time graph
>> >> >>> binding
>> >> >>>> happens. This typically happens only once during the inference
>> >> >>>> application’s life cycle, but since one can re-bind a symbol to say
>> >> >>> compare
>> >> >>>> a TensorRT and a non-TensorRT run, the check will happen during
>> each
>> >> >>>> bind/re-bind to enable that. Since the TensorRT graph pass is
>> enabled
>> >> >>> using
>> >> >>>> an environment variable, no break in the C++, C or any frontend
>> >> >> language
>> >> >>>> API is needed. Note that there is one more change required - in
>> >> calling
>> >> >>>> simple bind. This doesn’t change the simple bind API, but how it’s
>> >> >> called
>> >> >>>> relative to the “usual” case, by using some of the arguments which
>> >> are
>> >> >>>> optional. This has to do with the shared_buffer parameter. Before
>> >> >>>> explaining how the call changes, let’s consider why it’s necessary:
>> >> 1.
>> >> >>> The
>> >> >>>> TensorRT graph needs to be constructed during the simple bind call,
>> >> but
>> >> >>>> before memory gets allocated for the non-TensorRT part of the
>> graph.
>> >> 2.
>> >> >>>> TensorRT needs the weights, not just the shapes, to be provided
>> >> before
>> >> >>> the
>> >> >>>> engine is constructed - it will store them inside the ICudaEngine
>> >> >> object.
>> >> >>>> The engine will then be serialized inside the NNVM TensorRT op, and
>> >> >>>> deserialized when the graph executor takes over. This means that
>> the
>> >> >>>> weights need to be provided to the simple bind call to construct
>> the
>> >> >>>> TensorRT engine.3. The way to provide the weights is to hand them
>> >> over
>> >> >> to
>> >> >>>> the simple bind call via the “shared buffer” argument. The shared
>> >> >> buffer
>> >> >>>> weights can be provided during the bind call and can be freed by
>> the
>> >> >>>> frontend language once binding is complete (e.g. by exiting the
>> >> >> relevant
>> >> >>>> scope in Python, or calling del).Since we need both arg_params
>> >> >> (weights)
>> >> >>>> and aux_params (e.g. BatchNorm moments), we need to merge
>> arg_params
>> >> >> and
>> >> >>>> aux_params into one dictionary. Here’s a Python example:def
>> >> >>>> merge_dicts(*dict_args):    """Merge arg_params and aux_params to
>> >> >>> populate
>> >> >>>> shared_buffer"""    result = {}    for dictionary in dict_args:
>> >> >>>>       result.update(dictionary)    return resultNow let’s see a use
>> >> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params =
>> >> >>>>   mx.model.load_checkpoint(model_name, num_epochs)executor =
>> >> >>>> sym.simple_bind(ctx=device,    data=data_shape,
>> >> >>>>   softmax_label=(batch_size,),
>> >> >> shared_buffer=merge_dicts(arg_params,
>> >> >>>> aux_params),,    grad_req='null',    force_rebind=True)Now we can
>> >> >> simply
>> >> >>>> update data in the executor’s arg dict and run the forward
>> >> >>>> pass:executor.arg_dict["data"][:] =
>> >> >>>> my_data_batchexecutor.forward(is_train=False)predictions =
>> >> >>>> executor.outputs[0].asnumpy()Limitations of initial integration and
>> >> >>>> suggested future work 1. Since the new accelerator API proposal
>> (link
>> >> >>>> <
>> >> >>>
>> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
>> >> integration+with+external+acceleration+libraries
>> >> >>>> )
>> >> >>>> was only published a few days ago and the implementation is still
>> on
>> >> an
>> >> >>>> MxNet fork, the current TensorRT integration doesn’t use that API
>> >> yet,
>> >> >>> but
>> >> >>>> could be refactored in a future commit to use it. There is nothing
>> in
>> >> >> the
>> >> >>>> current design that would prevent making use of that API in the
>> near
>> >> >>>> future.2. Building the TensorRT engine takes a non-trivial amount
>> of
>> >> >>> time,
>> >> >>>> because the compiler evaluates performance and the hardware on the
>> >> >> system
>> >> >>>> before creating the fused layers on demand, and then needs to
>> >> actually
>> >> >>>> compile them. For ResNet-50 this may be a few seconds, but larger
>> >> >> models
>> >> >>>> also exist which may take longer. TensorRT comes with the ability
>> to
>> >> >>>> serialize the TensorRT engine for a particular hardware platform.
>> >> This
>> >> >> is
>> >> >>>> called the serialization of a TensorRT plan, which is the engine
>> >> along
>> >> >>> with
>> >> >>>> the ahead-of-time-compiled fused kernels for a given GPU. The first
>> >> PR
>> >> >> of
>> >> >>>> the TensorRT integration will not provide for TensorRT plan
>> caching,
>> >> so
>> >> >>>> using TensorRT might have a small start-up cost, but for
>> long-running
>> >> >>>> inference processes, this shouldn’t be a problem. Caching the
>> >> TensorRT
>> >> >>> plan
>> >> >>>> will be addressed in a future commit.3. As mentioned before, the
>> >> >>>> reproducibility of the build will be demonstrated using a Docker
>> file
>> >> >>> that
>> >> >>>> will provide an easy way to evaluate the build. The Docker recipe
>> was
>> >> >>>> tested on Linux on x86_64, but not other platforms supported by
>> >> >> TensorRT
>> >> >>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on
>> aarch64).
>> >> >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e.
>> >> Linux
>> >> >>> for
>> >> >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent
>> commits.
>> >> >> 4.
>> >> >>>> The current commit supports many, but not all, of TensorRT
>> operators.
>> >> >> For
>> >> >>>> example, this integration can run CNNs such as VGG, or ResNet, but
>> >> not
>> >> >>>> necessarily everything that TensorRT can support. More operators
>> will
>> >> >> be
>> >> >>>> covered in future commits.5. TensorRT supports plugins, which can
>> be
>> >> >>>> integrated into the graph pass. However, this was not a priority
>> >> since
>> >> >>> the
>> >> >>>> runtime TensorRT integration can always fall back to existing MxNet
>> >> >>>> operators. Supporting plugins is possible, but will be added in
>> >> future
>> >> >>>> commits.6. The upcoming PR will support fp16 and fp32, but not
>> int8.
>> >> >>> Since
>> >> >>>> int8 support in MxNet is itself very new, figuring out calibration
>> >> and
>> >> >>>> other details is left for a future commit.7. TensorRT 4 is going to
>> >> >> have
>> >> >>> a
>> >> >>>> new feature called BYOM (bring your own memory). This means that
>> >> >> instead
>> >> >>> of
>> >> >>>> telling TensorRT how much memory it can use, the data/scratch space
>> >> >>> tensors
>> >> >>>> can be provided by MxNet, and can be re-used by MxNet when not
>> >> running
>> >> >>> the
>> >> >>>> forward pass. The memory in permanent use will then be limited to
>> >> >>> TensorRT
>> >> >>>> storing weights. Support for this feature will be added in a future
>> >> >>> commit.*
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>> >
>> >
>>

Re: Details regarding upcoming PR for runtime TensorRT integration

Reply via email to