Re: Details regarding upcoming PR for runtime TensorRT integration

Marek Kolodziej Tue, 26 Jun 2018 11:20:36 -0700

Hi everyone,

Sorry for a delayed reply to this thread.


First of all, the updated documentation is now on Confluence:
https://cwiki.apache.org/confluence/display/MXNET/Runtime+Integration+with+TensorRT

Da, the details of partitioning the graph between non-TensorRT compatible
and incompatible nodes, which remain as MxNet native operators, are
provided on the above Wiki paged. The reason why the entire graph is not
converted to ONNX first is that ONNX also has a smaller subset of
compatible operators than MxNet does. So, the partitioning is done on the
NNVM graph. In order for a subgraph to be extracted, it has to be both
TensorRT compatible and ONNX compatible. Since ONNX's operator support is
strictly greater than TensorRT's, that's not a problem. The only exception
to this if a user has TensorRT plugins. However, this could be eventually
handled using ONNX extensions by storing metadata about nodes that are only
known to TensorRT via plugin registration. This could be done similarly as
PyTorch's ATen extension to ONNX. Also, the general goal of TensorRT
plugins is not ease of use, but extensibility of deployment. Typically, the
DL framework such as MxNet, is used for research and model training.
TensorRT runtime integration makes it easier to optimize inference
performance while not limiting the generality of the model on which to run
inference, because TensorRT isn't as general as the frameworks in terms of
operator support, and it also lacks the framework's data pipeline. In case
the user needs to deploy a hyperoptimized model that has operators that the
model does not support, they can port these operators to TensorRT plugins,
e.g. the way TensorRT plugins are used for non-maximum suppression for say
SSD. This lets the user deploy only with TensorRT, with the inconvenience
of rebuilding the data pipeline and adding plugins for operators that
TensorRT doesn't come with out of the box. Given this summary, it's
generally unlikely that a user would care about plugins for in-framework
inference, but they would if they need to deploy TensorRT as standalone.

I hope this helps!

Marek


On Mon, Jun 11, 2018 at 7:17 PM Da Zheng <[email protected]> wrote:

> Hello Marek,
>
> Thank you for your detailed design doc. My understanding is that the
> current implementation is to convert an NNVM graph to an ONNX graph
> and load the ONNX graph to TensorRT.
> What is unclear to me is how an operator unsupported by TensorRT is
> handled in this strategy. It seems you fall back to the MXNet
> operators. Your current solution partitions a graph and loads
> subgraphs to TensorRT? If so, why do you need to convert a partitioned
> subgraph to ONNX first? If you convert the entire NNVM graph to ONNX,
> could you describe in more details how to fall back to MXNet
> operators?
>
> Thanks,
> Da
>
>
> On Mon, Jun 11, 2018 at 6:29 PM, Hagay Lupesko <[email protected]> wrote:
> > +1 for reviewing a design doc.
> >
> > Naveen - why do you see it sit under ONNX? Isn't it a broader topic of
> GPU
> > acceleration?
> >
> > Hagay
> >
> > On Mon, Jun 11, 2018, 12:56 Naveen Swamy <[email protected]> wrote:
> >
> >> please add your proposal under design proposals, once the community has
> >> reviewed and there is consensus on the approach we can create a
> ONNX-MXNet
> >> sub section and move there.
> >>
> >> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy <[email protected]>
> wrote:
> >>
> >> > you have access now.
> >> >
> >> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <[email protected]>
> >> wrote:
> >> >
> >> >> I'll add in about an hour
> >> >>
> >> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
> >> >> [email protected]> wrote:
> >> >> >
> >> >> > I don't know how to grant permission on Confluence. If somebody
> else
> >> >> knows
> >> >> > how to do so, please grant Marek the edit permissions.
> >> >> >
> >> >> > -Marco
> >> >> >
> >> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <
> [email protected]>
> >> >> wrote:
> >> >> >>
> >> >> >> Hi Rajan,
> >> >> >>
> >> >> >> I wanted to share on Confluence, but it didn't allow me to create
> a
> >> new
> >> >> >> document. If my e-mail address gets permissions to add new
> Confluence
> >> >> >> pages, I'll transfer the contents to Confluence. Please keep me
> >> posted
> >> >> when
> >> >> >> I get edit permissions.
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Marek
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Jun 11, 2018 at 11:02 AM [email protected] <
> >> >> >> [email protected]> wrote:
> >> >> >>
> >> >> >>> HI Marek,
> >> >> >>>
> >> >> >>> Thanks for sharing the  document. It would be great if you could
> >> >> share it
> >> >> >>> on confluence wiki or a quip document. The formatting here makes
> it
> >> >> very
> >> >> >>> difficult to read a long document.
> >> >> >>>
> >> >> >>> Appreciate the help.
> >> >> >>>
> >> >> >>> Thanks
> >> >> >>> Rajan
> >> >> >>>
> >> >> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]>
> wrote:
> >> >> >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for
> >> >> >> open-sourcing
> >> >> >>> an
> >> >> >>>> initial integration of TensorRT as a runtime accelerator of
> MxNet
> >> (PR
> >> >> >> for
> >> >> >>>> discussion coming in the next few days, ETA of the first draft
> of
> >> the
> >> >> >> PR
> >> >> >>> is
> >> >> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek
> >> >> >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
> >> >> >> provides
> >> >> >>>> significant acceleration of model inference on NVIDIA GPUs
> compared
> >> >> to
> >> >> >>>> running the full graph in MxNet using unfused GPU operators. In
> >> >> >> addition
> >> >> >>> to
> >> >> >>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is
> >> >> >> capable
> >> >> >>> of
> >> >> >>>> int8 inference (provided the quantization steps are performed).
> >> >> Besides
> >> >> >>>> increasing throughput, TensorRT significantly reduces inference
> >> >> >> latency,
> >> >> >>>> especially for small batches. See more here
> >> >> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite its
> benefits,
> >> >> using
> >> >> >>>> pre-trained models with TensorRT typically requires some effort
> -
> >> >> >> either
> >> >> >>>> re-writing the model using TensorRT’s graph building APIs, or
> >> >> >> exporting a
> >> >> >>>> model to ONNX, followed by an import step. Even if the import is
> >> >> >>> simplified
> >> >> >>>> using ONNX, the TensorRT user still needs to provide their own
> data
> >> >> >>>> pipeline, which used to exist in the framework, but no longer
> does
> >> >> in a
> >> >> >>>> stand-alone TensorRT deployment with a client application.3.
> >> TensorRT
> >> >> >> is
> >> >> >>>> very performant, but does not have the full set of MxNet’s
> >> operators.
> >> >> >>> While
> >> >> >>>> that could be addressed with TensorRT plugins, it’s much
> simpler to
> >> >> >> reuse
> >> >> >>>> already-exisitng MxNet operators. Also, the user shouldn’t care
> >> about
> >> >> >>>> knowing which operators are supported by TensorRT and which ones
> >> >> >> aren’t -
> >> >> >>>> runtime integration allows the graph partitioner to extract
> >> subgraphs
> >> >> >>>> capable of running inside of TensorRT, place the subgraph in a
> >> >> TensorRT
> >> >> >>>> operator in MxNet, execute that operator as part of MxNet’s
> graph
> >> >> >>>> execusion, and handle non-TensorRT-compatible nodes as regular
> >> MxNet
> >> >> >>>> operators remaining after the TensorRT subgraph extraction and
> node
> >> >> >>>> substitution. The goal is to accelerate inference without
> changing
> >> >> user
> >> >> >>>> experience.Design considerations 1. Since TensorRT can only
> >> determine
> >> >> >> all
> >> >> >>>> possible optimizations once the tensor shapes are known, it is
> >> >> >> imperative
> >> >> >>>> that all the shape information be provided. This means that the
> >> best
> >> >> >> time
> >> >> >>>> to construct the TensorRT graph is bind time. The coming PR can
> >> >> >>> selectively
> >> >> >>>> apply the TensorRT optimization for inference-only graphs at
> symbol
> >> >> >> bind
> >> >> >>>> time. This is in fact consistent with the assumptions about
> >> TensorRT
> >> >> >> made
> >> >> >>>> on the MxNet Wiki here
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> >> integration+with+external+acceleration+libraries
> >> >> >>>> .
> >> >> >>>> 2. Since as mentioned in #1, TensorRT graph building needs shape
> >> >> >>>> information only available at bind time, an important goal was
> not
> >> to
> >> >> >>>> disrupt any existing APIs. Even though C++ permits default
> function
> >> >> >>>> arguments, the Python bindings for symbol-related methods (e.g.
> >> >> simple
> >> >> >>>> bind) are exposed via a C, not C++, API, wired on the Python
> side
> >> >> using
> >> >> >>>> Ctypes (e.g. see here
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://github.com/apache/incubator-mxnet/blob/master/python
> >> >> /mxnet/symbol/symbol.py#L1486:L1521
> >> >> >>>>
> >> >> >>>> for the simple bind integration). This precludes the addition of
> >> >> extra
> >> >> >>>> arguments without causing breaking changes in the C API. Also,
> >> >> adapting
> >> >> >>> the
> >> >> >>>> Python code to such changes wouldn’t be enough, since all
> frontend
> >> >> >>>> languages use the C (not C++) API for the FFI. Fortunately, C
> API
> >> >> >> changes
> >> >> >>>> could be avoided, by simply letting the user enable or disable
> the
> >> >> >>> TensorRT
> >> >> >>>> pass using an environment variable (USE_TENSORRT=1 to enable).
> This
> >> >> >> also
> >> >> >>>> does not diminish the flexibility of the integration, since the
> >> graph
> >> >> >>> pass
> >> >> >>>> can read the environment variable each time symbol binding is
> done,
> >> >> and
> >> >> >>>> hence permits turning the graph passes on and off, depending on
> >> need.
> >> >> >> The
> >> >> >>>> ability to enable and disable the TensorRT pass at runtime also
> >> makes
> >> >> >>> unit
> >> >> >>>> testing easier.3. TensorRT requires that the workspace size is
> >> >> provided
> >> >> >>> at
> >> >> >>>> graph construction time. This value constitutes the upper limit
> on
> >> >> the
> >> >> >>>> amount of memory that TensorRT can use, and does not determine
> >> >> >> immediate
> >> >> >>>> use. Since this amount can be hard for the user to know, its
> limit
> >> >> >> should
> >> >> >>>> be set to a reasonable value that the user need not concern
> >> >> themselves
> >> >> >>>> with. Given that TensorRT integration is applied at bind time
> and
> >> >> that
> >> >> >>>> TensorRT engines wrapped in TensorRT nodes are constructed
> during
> >> the
> >> >> >>> graph
> >> >> >>>> pass rather than the memory allocation pass,  MxNet will only
> >> >> allocate
> >> >> >>> the
> >> >> >>>> amount needed for the nodes remaining after the TensorRT
> subgraphs
> >> >> have
> >> >> >>>> been extracted. This means that no memory will be doubly
> allocated
> >> -
> >> >> >>> first
> >> >> >>>> for the complete MxNet subgraph and then for TensorRT. However,
> the
> >> >> >>>> question remains whether the memory used per TensorRT engine
> should
> >> >> be
> >> >> >> a
> >> >> >>>> configurable parameter, either as a method argument or an
> >> environment
> >> >> >>>> variable, or whether TensorRT should be able to use the maximum
> >> >> >> available
> >> >> >>>> GPU memory and then reserve only what it needs. I would like to
> >> >> suggest
> >> >> >>> the
> >> >> >>>> latter. Since the TensorRT subgraph will typically use less
> memory
> >> >> than
> >> >> >>> the
> >> >> >>>> same subgraph in MxNet (due to more layer fusion), it’s
> extremely
> >> >> >>> unlikely
> >> >> >>>> that a model which runs purely as an MxNet graph would fail
> with an
> >> >> ouf
> >> >> >>> of
> >> >> >>>> memory error when parts or most of the graph run inside
> TensorRT.
> >> >> Fewer
> >> >> >>>> knobs (in this case, not giving the user the ability to tweak
> the
> >> >> >> maximum
> >> >> >>>> amount of memory availble to TensorRT would simplify use.4.
> >> TensorRT
> >> >> >> can
> >> >> >>>> accept graphs constructed using two main approaches: (a) via the
> >> >> >> TensorRT
> >> >> >>>> graph API, (b) using ONNX. Approach (a) seems simple on the
> >> surface -
> >> >> >> one
> >> >> >>>> traverses the NNVM graph, finds subgraphs that TensorRT can
> >> execute,
> >> >> >>>> converts the subgraphs to TensorRT graphs, and substitutes the
> >> >> >> subgraphs
> >> >> >>>> with TensorRT nodes, each of which contain the TensorRT engine
> >> >> >>>> corresponding to the subgraph. However, the approach taken by
> NVIDA
> >> >> was
> >> >> >>> to
> >> >> >>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX
> is
> >> a
> >> >> >> very
> >> >> >>>> well-known IR, which is supported by the entire deep learning
> >> >> software
> >> >> >>>> community. This ensures that the design of the IR gets as much
> >> >> feedback
> >> >> >>> as
> >> >> >>>> possible as to whether the IR is feature complete, and what the
> >> >> >> semantics
> >> >> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter
> (link
> >> >> >>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to
> do
> >> >> so.
> >> >> >>>> Whatever changes that may apply to the TensorRT APIs or the
> >> internal
> >> >> >>>> features may be nicely hidden behind the well-established ONNX
> IR.
> >> >> >>> Second,
> >> >> >>>> ONNX is growing beyond being merely an IR. As it becomes more
> of a
> >> >> >>>> standard, its adoption will be associated with other benefits,
> such
> >> >> as
> >> >> >>> the
> >> >> >>>> ability to verify standard compliance.5. Despite the advantages
> of
> >> >> >> using
> >> >> >>>> the ONNX route described in #4, there are some costs. The main
> one
> >> is
> >> >> >>> the
> >> >> >>>> dependency on Protobuf. This is a valid criticism on the
> surface,
> >> >> >>> however,
> >> >> >>>> since the TensorRT integration requires an opt-in during build
> >> time,
> >> >> >>> adding
> >> >> >>>> one more dependency is not a problem if it is not a mandatory
> >> >> >> dependency.
> >> >> >>>> Moreover, the same Protobuf dependency already exists for the
> MxNet
> >> >> >> ONNX
> >> >> >>>> importer, which is now part of the MxNet source tree (link
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a8
> >> >> 5ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md
> >> >> >>>> ),
> >> >> >>>> rather than being located in a separate repository. Just like
> the
> >> use
> >> >> >> of
> >> >> >>>> the ONNX importer is optional and requires ONNX (and hence also
> >> >> >>> Protobuf),
> >> >> >>>> the TensorRT build is optional. 6. The optional integration of
> >> >> TensorRT
> >> >> >>>> will be guarded using a config.mk <http://config.mk> flag
> >> >> >>> (USE_TENSORRT),
> >> >> >>>> which will function similarly to other flags, such as USE_CUDA,
> >> >> >>> USE_CUDNN,
> >> >> >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and
> cuDNN.7.
> >> >> In
> >> >> >>>> order to simplify evaluation of the TensorRT build, usability
> and
> >> to
> >> >> >> run
> >> >> >>>> unit tests, the PR will come with a Dockerfile, which will allow
> >> >> anyone
> >> >> >>> to
> >> >> >>>> build MxNet with TensorRT, along with its dependencies, i.e.
> >> Protobuf
> >> >> >> and
> >> >> >>>> ONNX. APIs / user experienceThere is no change in the inference
> >> APIs,
> >> >> >>>> except for the need to set the MXNET_USE_TENSORRT environment
> >> >> variable
> >> >> >> to
> >> >> >>>> 1. For example, in Python, we can simply
> >> >> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward
> >> >> >>>> compatibility, if the environment variable is not set, it will
> >> >> default
> >> >> >> to
> >> >> >>>> 0. Also, unlike some other environment variables that are only
> >> >> checked
> >> >> >>>> during MxNet initialization, this one gets checked every time
> graph
> >> >> >>> binding
> >> >> >>>> happens. This typically happens only once during the inference
> >> >> >>>> application’s life cycle, but since one can re-bind a symbol to
> say
> >> >> >>> compare
> >> >> >>>> a TensorRT and a non-TensorRT run, the check will happen during
> >> each
> >> >> >>>> bind/re-bind to enable that. Since the TensorRT graph pass is
> >> enabled
> >> >> >>> using
> >> >> >>>> an environment variable, no break in the C++, C or any frontend
> >> >> >> language
> >> >> >>>> API is needed. Note that there is one more change required - in
> >> >> calling
> >> >> >>>> simple bind. This doesn’t change the simple bind API, but how
> it’s
> >> >> >> called
> >> >> >>>> relative to the “usual” case, by using some of the arguments
> which
> >> >> are
> >> >> >>>> optional. This has to do with the shared_buffer parameter.
> Before
> >> >> >>>> explaining how the call changes, let’s consider why it’s
> necessary:
> >> >> 1.
> >> >> >>> The
> >> >> >>>> TensorRT graph needs to be constructed during the simple bind
> call,
> >> >> but
> >> >> >>>> before memory gets allocated for the non-TensorRT part of the
> >> graph.
> >> >> 2.
> >> >> >>>> TensorRT needs the weights, not just the shapes, to be provided
> >> >> before
> >> >> >>> the
> >> >> >>>> engine is constructed - it will store them inside the
> ICudaEngine
> >> >> >> object.
> >> >> >>>> The engine will then be serialized inside the NNVM TensorRT op,
> and
> >> >> >>>> deserialized when the graph executor takes over. This means that
> >> the
> >> >> >>>> weights need to be provided to the simple bind call to construct
> >> the
> >> >> >>>> TensorRT engine.3. The way to provide the weights is to hand
> them
> >> >> over
> >> >> >> to
> >> >> >>>> the simple bind call via the “shared buffer” argument. The
> shared
> >> >> >> buffer
> >> >> >>>> weights can be provided during the bind call and can be freed by
> >> the
> >> >> >>>> frontend language once binding is complete (e.g. by exiting the
> >> >> >> relevant
> >> >> >>>> scope in Python, or calling del).Since we need both arg_params
> >> >> >> (weights)
> >> >> >>>> and aux_params (e.g. BatchNorm moments), we need to merge
> >> arg_params
> >> >> >> and
> >> >> >>>> aux_params into one dictionary. Here’s a Python example:def
> >> >> >>>> merge_dicts(*dict_args):    """Merge arg_params and aux_params
> to
> >> >> >>> populate
> >> >> >>>> shared_buffer"""    result = {}    for dictionary in dict_args:
> >> >> >>>>       result.update(dictionary)    return resultNow let’s see a
> use
> >> >> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params =
> >> >> >>>>   mx.model.load_checkpoint(model_name, num_epochs)executor =
> >> >> >>>> sym.simple_bind(ctx=device,    data=data_shape,
> >> >> >>>>   softmax_label=(batch_size,),
> >> >> >> shared_buffer=merge_dicts(arg_params,
> >> >> >>>> aux_params),,    grad_req='null',    force_rebind=True)Now we
> can
> >> >> >> simply
> >> >> >>>> update data in the executor’s arg dict and run the forward
> >> >> >>>> pass:executor.arg_dict["data"][:] =
> >> >> >>>> my_data_batchexecutor.forward(is_train=False)predictions =
> >> >> >>>> executor.outputs[0].asnumpy()Limitations of initial integration
> and
> >> >> >>>> suggested future work 1. Since the new accelerator API proposal
> >> (link
> >> >> >>>> <
> >> >> >>>
> >> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> >> integration+with+external+acceleration+libraries
> >> >> >>>> )
> >> >> >>>> was only published a few days ago and the implementation is
> still
> >> on
> >> >> an
> >> >> >>>> MxNet fork, the current TensorRT integration doesn’t use that
> API
> >> >> yet,
> >> >> >>> but
> >> >> >>>> could be refactored in a future commit to use it. There is
> nothing
> >> in
> >> >> >> the
> >> >> >>>> current design that would prevent making use of that API in the
> >> near
> >> >> >>>> future.2. Building the TensorRT engine takes a non-trivial
> amount
> >> of
> >> >> >>> time,
> >> >> >>>> because the compiler evaluates performance and the hardware on
> the
> >> >> >> system
> >> >> >>>> before creating the fused layers on demand, and then needs to
> >> >> actually
> >> >> >>>> compile them. For ResNet-50 this may be a few seconds, but
> larger
> >> >> >> models
> >> >> >>>> also exist which may take longer. TensorRT comes with the
> ability
> >> to
> >> >> >>>> serialize the TensorRT engine for a particular hardware
> platform.
> >> >> This
> >> >> >> is
> >> >> >>>> called the serialization of a TensorRT plan, which is the engine
> >> >> along
> >> >> >>> with
> >> >> >>>> the ahead-of-time-compiled fused kernels for a given GPU. The
> first
> >> >> PR
> >> >> >> of
> >> >> >>>> the TensorRT integration will not provide for TensorRT plan
> >> caching,
> >> >> so
> >> >> >>>> using TensorRT might have a small start-up cost, but for
> >> long-running
> >> >> >>>> inference processes, this shouldn’t be a problem. Caching the
> >> >> TensorRT
> >> >> >>> plan
> >> >> >>>> will be addressed in a future commit.3. As mentioned before, the
> >> >> >>>> reproducibility of the build will be demonstrated using a Docker
> >> file
> >> >> >>> that
> >> >> >>>> will provide an easy way to evaluate the build. The Docker
> recipe
> >> was
> >> >> >>>> tested on Linux on x86_64, but not other platforms supported by
> >> >> >> TensorRT
> >> >> >>>> (Linux on 64-bit ARM  (aarch64), Android on aarch64, QNX on
> >> aarch64).
> >> >> >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T,
> i.e.
> >> >> Linux
> >> >> >>> for
> >> >> >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent
> >> commits.
> >> >> >> 4.
> >> >> >>>> The current commit supports many, but not all, of TensorRT
> >> operators.
> >> >> >> For
> >> >> >>>> example, this integration can run CNNs such as VGG, or ResNet,
> but
> >> >> not
> >> >> >>>> necessarily everything that TensorRT can support. More operators
> >> will
> >> >> >> be
> >> >> >>>> covered in future commits.5. TensorRT supports plugins, which
> can
> >> be
> >> >> >>>> integrated into the graph pass. However, this was not a priority
> >> >> since
> >> >> >>> the
> >> >> >>>> runtime TensorRT integration can always fall back to existing
> MxNet
> >> >> >>>> operators. Supporting plugins is possible, but will be added in
> >> >> future
> >> >> >>>> commits.6. The upcoming PR will support fp16 and fp32, but not
> >> int8.
> >> >> >>> Since
> >> >> >>>> int8 support in MxNet is itself very new, figuring out
> calibration
> >> >> and
> >> >> >>>> other details is left for a future commit.7. TensorRT 4 is
> going to
> >> >> >> have
> >> >> >>> a
> >> >> >>>> new feature called BYOM (bring your own memory). This means that
> >> >> >> instead
> >> >> >>> of
> >> >> >>>> telling TensorRT how much memory it can use, the data/scratch
> space
> >> >> >>> tensors
> >> >> >>>> can be provided by MxNet, and can be re-used by MxNet when not
> >> >> running
> >> >> >>> the
> >> >> >>>> forward pass. The memory in permanent use will then be limited
> to
> >> >> >>> TensorRT
> >> >> >>>> storing weights. Support for this feature will be added in a
> future
> >> >> >>> commit.*
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>

Re: Details regarding upcoming PR for runtime TensorRT integration

Reply via email to