I'll add in about an hour
> On Jun 11, 2018, at 8:12 PM, Marco de Abreu <[email protected]> > wrote: > > I don't know how to grant permission on Confluence. If somebody else knows > how to do so, please grant Marek the edit permissions. > > -Marco > >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <[email protected]> wrote: >> >> Hi Rajan, >> >> I wanted to share on Confluence, but it didn't allow me to create a new >> document. If my e-mail address gets permissions to add new Confluence >> pages, I'll transfer the contents to Confluence. Please keep me posted when >> I get edit permissions. >> >> Thanks! >> >> Marek >> >> >> >> On Mon, Jun 11, 2018 at 11:02 AM [email protected] < >> [email protected]> wrote: >> >>> HI Marek, >>> >>> Thanks for sharing the document. It would be great if you could share it >>> on confluence wiki or a quip document. The formatting here makes it very >>> difficult to read a long document. >>> >>> Appreciate the help. >>> >>> Thanks >>> Rajan >>> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]> wrote: >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for >> open-sourcing >>> an >>>> initial integration of TensorRT as a runtime accelerator of MxNet (PR >> for >>>> discussion coming in the next few days, ETA of the first draft of the >> PR >>> is >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT >> provides >>>> significant acceleration of model inference on NVIDIA GPUs compared to >>>> running the full graph in MxNet using unfused GPU operators. In >> addition >>> to >>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is >> capable >>> of >>>> int8 inference (provided the quantization steps are performed). Besides >>>> increasing throughput, TensorRT significantly reduces inference >> latency, >>>> especially for small batches. See more here >>>> <https://developer.nvidia.com/tensorrt>.2. Despite its benefits, using >>>> pre-trained models with TensorRT typically requires some effort - >> either >>>> re-writing the model using TensorRT’s graph building APIs, or >> exporting a >>>> model to ONNX, followed by an import step. Even if the import is >>> simplified >>>> using ONNX, the TensorRT user still needs to provide their own data >>>> pipeline, which used to exist in the framework, but no longer does in a >>>> stand-alone TensorRT deployment with a client application.3. TensorRT >> is >>>> very performant, but does not have the full set of MxNet’s operators. >>> While >>>> that could be addressed with TensorRT plugins, it’s much simpler to >> reuse >>>> already-exisitng MxNet operators. Also, the user shouldn’t care about >>>> knowing which operators are supported by TensorRT and which ones >> aren’t - >>>> runtime integration allows the graph partitioner to extract subgraphs >>>> capable of running inside of TensorRT, place the subgraph in a TensorRT >>>> operator in MxNet, execute that operator as part of MxNet’s graph >>>> execusion, and handle non-TensorRT-compatible nodes as regular MxNet >>>> operators remaining after the TensorRT subgraph extraction and node >>>> substitution. The goal is to accelerate inference without changing user >>>> experience.Design considerations 1. Since TensorRT can only determine >> all >>>> possible optimizations once the tensor shapes are known, it is >> imperative >>>> that all the shape information be provided. This means that the best >> time >>>> to construct the TensorRT graph is bind time. The coming PR can >>> selectively >>>> apply the TensorRT optimization for inference-only graphs at symbol >> bind >>>> time. This is in fact consistent with the assumptions about TensorRT >> made >>>> on the MxNet Wiki here >>>> < >>> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries >>>> . >>>> 2. Since as mentioned in #1, TensorRT graph building needs shape >>>> information only available at bind time, an important goal was not to >>>> disrupt any existing APIs. Even though C++ permits default function >>>> arguments, the Python bindings for symbol-related methods (e.g. simple >>>> bind) are exposed via a C, not C++, API, wired on the Python side using >>>> Ctypes (e.g. see here >>>> < >>> >> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521 >>>> >>>> for the simple bind integration). This precludes the addition of extra >>>> arguments without causing breaking changes in the C API. Also, adapting >>> the >>>> Python code to such changes wouldn’t be enough, since all frontend >>>> languages use the C (not C++) API for the FFI. Fortunately, C API >> changes >>>> could be avoided, by simply letting the user enable or disable the >>> TensorRT >>>> pass using an environment variable (USE_TENSORRT=1 to enable). This >> also >>>> does not diminish the flexibility of the integration, since the graph >>> pass >>>> can read the environment variable each time symbol binding is done, and >>>> hence permits turning the graph passes on and off, depending on need. >> The >>>> ability to enable and disable the TensorRT pass at runtime also makes >>> unit >>>> testing easier.3. TensorRT requires that the workspace size is provided >>> at >>>> graph construction time. This value constitutes the upper limit on the >>>> amount of memory that TensorRT can use, and does not determine >> immediate >>>> use. Since this amount can be hard for the user to know, its limit >> should >>>> be set to a reasonable value that the user need not concern themselves >>>> with. Given that TensorRT integration is applied at bind time and that >>>> TensorRT engines wrapped in TensorRT nodes are constructed during the >>> graph >>>> pass rather than the memory allocation pass, MxNet will only allocate >>> the >>>> amount needed for the nodes remaining after the TensorRT subgraphs have >>>> been extracted. This means that no memory will be doubly allocated - >>> first >>>> for the complete MxNet subgraph and then for TensorRT. However, the >>>> question remains whether the memory used per TensorRT engine should be >> a >>>> configurable parameter, either as a method argument or an environment >>>> variable, or whether TensorRT should be able to use the maximum >> available >>>> GPU memory and then reserve only what it needs. I would like to suggest >>> the >>>> latter. Since the TensorRT subgraph will typically use less memory than >>> the >>>> same subgraph in MxNet (due to more layer fusion), it’s extremely >>> unlikely >>>> that a model which runs purely as an MxNet graph would fail with an ouf >>> of >>>> memory error when parts or most of the graph run inside TensorRT. Fewer >>>> knobs (in this case, not giving the user the ability to tweak the >> maximum >>>> amount of memory availble to TensorRT would simplify use.4. TensorRT >> can >>>> accept graphs constructed using two main approaches: (a) via the >> TensorRT >>>> graph API, (b) using ONNX. Approach (a) seems simple on the surface - >> one >>>> traverses the NNVM graph, finds subgraphs that TensorRT can execute, >>>> converts the subgraphs to TensorRT graphs, and substitutes the >> subgraphs >>>> with TensorRT nodes, each of which contain the TensorRT engine >>>> corresponding to the subgraph. However, the approach taken by NVIDA was >>> to >>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX is a >> very >>>> well-known IR, which is supported by the entire deep learning software >>>> community. This ensures that the design of the IR gets as much feedback >>> as >>>> possible as to whether the IR is feature complete, and what the >> semantics >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter (link >>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to do so. >>>> Whatever changes that may apply to the TensorRT APIs or the internal >>>> features may be nicely hidden behind the well-established ONNX IR. >>> Second, >>>> ONNX is growing beyond being merely an IR. As it becomes more of a >>>> standard, its adoption will be associated with other benefits, such as >>> the >>>> ability to verify standard compliance.5. Despite the advantages of >> using >>>> the ONNX route described in #4, there are some costs. The main one is >>> the >>>> dependency on Protobuf. This is a valid criticism on the surface, >>> however, >>>> since the TensorRT integration requires an opt-in during build time, >>> adding >>>> one more dependency is not a problem if it is not a mandatory >> dependency. >>>> Moreover, the same Protobuf dependency already exists for the MxNet >> ONNX >>>> importer, which is now part of the MxNet source tree (link >>>> < >>> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a85ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md >>>> ), >>>> rather than being located in a separate repository. Just like the use >> of >>>> the ONNX importer is optional and requires ONNX (and hence also >>> Protobuf), >>>> the TensorRT build is optional. 6. The optional integration of TensorRT >>>> will be guarded using a config.mk <http://config.mk> flag >>> (USE_TENSORRT), >>>> which will function similarly to other flags, such as USE_CUDA, >>> USE_CUDNN, >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7. In >>>> order to simplify evaluation of the TensorRT build, usability and to >> run >>>> unit tests, the PR will come with a Dockerfile, which will allow anyone >>> to >>>> build MxNet with TensorRT, along with its dependencies, i.e. Protobuf >> and >>>> ONNX. APIs / user experienceThere is no change in the inference APIs, >>>> except for the need to set the MXNET_USE_TENSORRT environment variable >> to >>>> 1. For example, in Python, we can simply >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward >>>> compatibility, if the environment variable is not set, it will default >> to >>>> 0. Also, unlike some other environment variables that are only checked >>>> during MxNet initialization, this one gets checked every time graph >>> binding >>>> happens. This typically happens only once during the inference >>>> application’s life cycle, but since one can re-bind a symbol to say >>> compare >>>> a TensorRT and a non-TensorRT run, the check will happen during each >>>> bind/re-bind to enable that. Since the TensorRT graph pass is enabled >>> using >>>> an environment variable, no break in the C++, C or any frontend >> language >>>> API is needed. Note that there is one more change required - in calling >>>> simple bind. This doesn’t change the simple bind API, but how it’s >> called >>>> relative to the “usual” case, by using some of the arguments which are >>>> optional. This has to do with the shared_buffer parameter. Before >>>> explaining how the call changes, let’s consider why it’s necessary: 1. >>> The >>>> TensorRT graph needs to be constructed during the simple bind call, but >>>> before memory gets allocated for the non-TensorRT part of the graph. 2. >>>> TensorRT needs the weights, not just the shapes, to be provided before >>> the >>>> engine is constructed - it will store them inside the ICudaEngine >> object. >>>> The engine will then be serialized inside the NNVM TensorRT op, and >>>> deserialized when the graph executor takes over. This means that the >>>> weights need to be provided to the simple bind call to construct the >>>> TensorRT engine.3. The way to provide the weights is to hand them over >> to >>>> the simple bind call via the “shared buffer” argument. The shared >> buffer >>>> weights can be provided during the bind call and can be freed by the >>>> frontend language once binding is complete (e.g. by exiting the >> relevant >>>> scope in Python, or calling del).Since we need both arg_params >> (weights) >>>> and aux_params (e.g. BatchNorm moments), we need to merge arg_params >> and >>>> aux_params into one dictionary. Here’s a Python example:def >>>> merge_dicts(*dict_args): """Merge arg_params and aux_params to >>> populate >>>> shared_buffer""" result = {} for dictionary in dict_args: >>>> result.update(dictionary) return resultNow let’s see a use >>>> example:device = mx.gpu(0)sym, arg_params, aux_params = >>>> mx.model.load_checkpoint(model_name, num_epochs)executor = >>>> sym.simple_bind(ctx=device, data=data_shape, >>>> softmax_label=(batch_size,), >> shared_buffer=merge_dicts(arg_params, >>>> aux_params),, grad_req='null', force_rebind=True)Now we can >> simply >>>> update data in the executor’s arg dict and run the forward >>>> pass:executor.arg_dict["data"][:] = >>>> my_data_batchexecutor.forward(is_train=False)predictions = >>>> executor.outputs[0].asnumpy()Limitations of initial integration and >>>> suggested future work 1. Since the new accelerator API proposal (link >>>> < >>> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries >>>> ) >>>> was only published a few days ago and the implementation is still on an >>>> MxNet fork, the current TensorRT integration doesn’t use that API yet, >>> but >>>> could be refactored in a future commit to use it. There is nothing in >> the >>>> current design that would prevent making use of that API in the near >>>> future.2. Building the TensorRT engine takes a non-trivial amount of >>> time, >>>> because the compiler evaluates performance and the hardware on the >> system >>>> before creating the fused layers on demand, and then needs to actually >>>> compile them. For ResNet-50 this may be a few seconds, but larger >> models >>>> also exist which may take longer. TensorRT comes with the ability to >>>> serialize the TensorRT engine for a particular hardware platform. This >> is >>>> called the serialization of a TensorRT plan, which is the engine along >>> with >>>> the ahead-of-time-compiled fused kernels for a given GPU. The first PR >> of >>>> the TensorRT integration will not provide for TensorRT plan caching, so >>>> using TensorRT might have a small start-up cost, but for long-running >>>> inference processes, this shouldn’t be a problem. Caching the TensorRT >>> plan >>>> will be addressed in a future commit.3. As mentioned before, the >>>> reproducibility of the build will be demonstrated using a Docker file >>> that >>>> will provide an easy way to evaluate the build. The Docker recipe was >>>> tested on Linux on x86_64, but not other platforms supported by >> TensorRT >>>> (Linux on 64-bit ARM (aarch64), Android on aarch64, QNX on aarch64). >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e. Linux >>> for >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent commits. >> 4. >>>> The current commit supports many, but not all, of TensorRT operators. >> For >>>> example, this integration can run CNNs such as VGG, or ResNet, but not >>>> necessarily everything that TensorRT can support. More operators will >> be >>>> covered in future commits.5. TensorRT supports plugins, which can be >>>> integrated into the graph pass. However, this was not a priority since >>> the >>>> runtime TensorRT integration can always fall back to existing MxNet >>>> operators. Supporting plugins is possible, but will be added in future >>>> commits.6. The upcoming PR will support fp16 and fp32, but not int8. >>> Since >>>> int8 support in MxNet is itself very new, figuring out calibration and >>>> other details is left for a future commit.7. TensorRT 4 is going to >> have >>> a >>>> new feature called BYOM (bring your own memory). This means that >> instead >>> of >>>> telling TensorRT how much memory it can use, the data/scratch space >>> tensors >>>> can be provided by MxNet, and can be re-used by MxNet when not running >>> the >>>> forward pass. The memory in permanent use will then be limited to >>> TensorRT >>>> storing weights. Support for this feature will be added in a future >>> commit.* >>>> >>> >>
