+1 for reviewing a design doc. Naveen - why do you see it sit under ONNX? Isn't it a broader topic of GPU acceleration?
Hagay On Mon, Jun 11, 2018, 12:56 Naveen Swamy <[email protected]> wrote: > please add your proposal under design proposals, once the community has > reviewed and there is consensus on the approach we can create a ONNX-MXNet > sub section and move there. > > On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy <[email protected]> wrote: > > > you have access now. > > > > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy <[email protected]> > wrote: > > > >> I'll add in about an hour > >> > >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu < > >> [email protected]> wrote: > >> > > >> > I don't know how to grant permission on Confluence. If somebody else > >> knows > >> > how to do so, please grant Marek the edit permissions. > >> > > >> > -Marco > >> > > >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej <[email protected]> > >> wrote: > >> >> > >> >> Hi Rajan, > >> >> > >> >> I wanted to share on Confluence, but it didn't allow me to create a > new > >> >> document. If my e-mail address gets permissions to add new Confluence > >> >> pages, I'll transfer the contents to Confluence. Please keep me > posted > >> when > >> >> I get edit permissions. > >> >> > >> >> Thanks! > >> >> > >> >> Marek > >> >> > >> >> > >> >> > >> >> On Mon, Jun 11, 2018 at 11:02 AM [email protected] < > >> >> [email protected]> wrote: > >> >> > >> >>> HI Marek, > >> >>> > >> >>> Thanks for sharing the document. It would be great if you could > >> share it > >> >>> on confluence wiki or a quip document. The formatting here makes it > >> very > >> >>> difficult to read a long document. > >> >>> > >> >>> Appreciate the help. > >> >>> > >> >>> Thanks > >> >>> Rajan > >> >>> > >> >>>> On 2018/06/11 17:50:26, Marek Kolodziej <[email protected]> wrote: > >> >>>> *Hi everyone,This is a quick summary of NVIDIA’s plans for > >> >> open-sourcing > >> >>> an > >> >>>> initial integration of TensorRT as a runtime accelerator of MxNet > (PR > >> >> for > >> >>>> discussion coming in the next few days, ETA of the first draft of > the > >> >> PR > >> >>> is > >> >>>> this Friday or even earlier). Feedback is appreciated.Best,Marek > >> >>>> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT > >> >> provides > >> >>>> significant acceleration of model inference on NVIDIA GPUs compared > >> to > >> >>>> running the full graph in MxNet using unfused GPU operators. In > >> >> addition > >> >>> to > >> >>>> faster fp32 inference, TensorRT optimizes fp16 inference, and is > >> >> capable > >> >>> of > >> >>>> int8 inference (provided the quantization steps are performed). > >> Besides > >> >>>> increasing throughput, TensorRT significantly reduces inference > >> >> latency, > >> >>>> especially for small batches. See more here > >> >>>> <https://developer.nvidia.com/tensorrt>.2. Despite its benefits, > >> using > >> >>>> pre-trained models with TensorRT typically requires some effort - > >> >> either > >> >>>> re-writing the model using TensorRT’s graph building APIs, or > >> >> exporting a > >> >>>> model to ONNX, followed by an import step. Even if the import is > >> >>> simplified > >> >>>> using ONNX, the TensorRT user still needs to provide their own data > >> >>>> pipeline, which used to exist in the framework, but no longer does > >> in a > >> >>>> stand-alone TensorRT deployment with a client application.3. > TensorRT > >> >> is > >> >>>> very performant, but does not have the full set of MxNet’s > operators. > >> >>> While > >> >>>> that could be addressed with TensorRT plugins, it’s much simpler to > >> >> reuse > >> >>>> already-exisitng MxNet operators. Also, the user shouldn’t care > about > >> >>>> knowing which operators are supported by TensorRT and which ones > >> >> aren’t - > >> >>>> runtime integration allows the graph partitioner to extract > subgraphs > >> >>>> capable of running inside of TensorRT, place the subgraph in a > >> TensorRT > >> >>>> operator in MxNet, execute that operator as part of MxNet’s graph > >> >>>> execusion, and handle non-TensorRT-compatible nodes as regular > MxNet > >> >>>> operators remaining after the TensorRT subgraph extraction and node > >> >>>> substitution. The goal is to accelerate inference without changing > >> user > >> >>>> experience.Design considerations 1. Since TensorRT can only > determine > >> >> all > >> >>>> possible optimizations once the tensor shapes are known, it is > >> >> imperative > >> >>>> that all the shape information be provided. This means that the > best > >> >> time > >> >>>> to construct the TensorRT graph is bind time. The coming PR can > >> >>> selectively > >> >>>> apply the TensorRT optimization for inference-only graphs at symbol > >> >> bind > >> >>>> time. This is in fact consistent with the assumptions about > TensorRT > >> >> made > >> >>>> on the MxNet Wiki here > >> >>>> < > >> >>> > >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+ > >> integration+with+external+acceleration+libraries > >> >>>> . > >> >>>> 2. Since as mentioned in #1, TensorRT graph building needs shape > >> >>>> information only available at bind time, an important goal was not > to > >> >>>> disrupt any existing APIs. Even though C++ permits default function > >> >>>> arguments, the Python bindings for symbol-related methods (e.g. > >> simple > >> >>>> bind) are exposed via a C, not C++, API, wired on the Python side > >> using > >> >>>> Ctypes (e.g. see here > >> >>>> < > >> >>> > >> >> https://github.com/apache/incubator-mxnet/blob/master/python > >> /mxnet/symbol/symbol.py#L1486:L1521 > >> >>>> > >> >>>> for the simple bind integration). This precludes the addition of > >> extra > >> >>>> arguments without causing breaking changes in the C API. Also, > >> adapting > >> >>> the > >> >>>> Python code to such changes wouldn’t be enough, since all frontend > >> >>>> languages use the C (not C++) API for the FFI. Fortunately, C API > >> >> changes > >> >>>> could be avoided, by simply letting the user enable or disable the > >> >>> TensorRT > >> >>>> pass using an environment variable (USE_TENSORRT=1 to enable). This > >> >> also > >> >>>> does not diminish the flexibility of the integration, since the > graph > >> >>> pass > >> >>>> can read the environment variable each time symbol binding is done, > >> and > >> >>>> hence permits turning the graph passes on and off, depending on > need. > >> >> The > >> >>>> ability to enable and disable the TensorRT pass at runtime also > makes > >> >>> unit > >> >>>> testing easier.3. TensorRT requires that the workspace size is > >> provided > >> >>> at > >> >>>> graph construction time. This value constitutes the upper limit on > >> the > >> >>>> amount of memory that TensorRT can use, and does not determine > >> >> immediate > >> >>>> use. Since this amount can be hard for the user to know, its limit > >> >> should > >> >>>> be set to a reasonable value that the user need not concern > >> themselves > >> >>>> with. Given that TensorRT integration is applied at bind time and > >> that > >> >>>> TensorRT engines wrapped in TensorRT nodes are constructed during > the > >> >>> graph > >> >>>> pass rather than the memory allocation pass, MxNet will only > >> allocate > >> >>> the > >> >>>> amount needed for the nodes remaining after the TensorRT subgraphs > >> have > >> >>>> been extracted. This means that no memory will be doubly allocated > - > >> >>> first > >> >>>> for the complete MxNet subgraph and then for TensorRT. However, the > >> >>>> question remains whether the memory used per TensorRT engine should > >> be > >> >> a > >> >>>> configurable parameter, either as a method argument or an > environment > >> >>>> variable, or whether TensorRT should be able to use the maximum > >> >> available > >> >>>> GPU memory and then reserve only what it needs. I would like to > >> suggest > >> >>> the > >> >>>> latter. Since the TensorRT subgraph will typically use less memory > >> than > >> >>> the > >> >>>> same subgraph in MxNet (due to more layer fusion), it’s extremely > >> >>> unlikely > >> >>>> that a model which runs purely as an MxNet graph would fail with an > >> ouf > >> >>> of > >> >>>> memory error when parts or most of the graph run inside TensorRT. > >> Fewer > >> >>>> knobs (in this case, not giving the user the ability to tweak the > >> >> maximum > >> >>>> amount of memory availble to TensorRT would simplify use.4. > TensorRT > >> >> can > >> >>>> accept graphs constructed using two main approaches: (a) via the > >> >> TensorRT > >> >>>> graph API, (b) using ONNX. Approach (a) seems simple on the > surface - > >> >> one > >> >>>> traverses the NNVM graph, finds subgraphs that TensorRT can > execute, > >> >>>> converts the subgraphs to TensorRT graphs, and substitutes the > >> >> subgraphs > >> >>>> with TensorRT nodes, each of which contain the TensorRT engine > >> >>>> corresponding to the subgraph. However, the approach taken by NVIDA > >> was > >> >>> to > >> >>>> use ONNX as tha IR. The reason for this is twofold. First, ONNX is > a > >> >> very > >> >>>> well-known IR, which is supported by the entire deep learning > >> software > >> >>>> community. This ensures that the design of the IR gets as much > >> feedback > >> >>> as > >> >>>> possible as to whether the IR is feature complete, and what the > >> >> semantics > >> >>>> are. NVIDIA already maintains an ONNX-to-TensorRT converter (link > >> >>>> <https://github.com/onnx/onnx-tensorrt>), and will continue to do > >> so. > >> >>>> Whatever changes that may apply to the TensorRT APIs or the > internal > >> >>>> features may be nicely hidden behind the well-established ONNX IR. > >> >>> Second, > >> >>>> ONNX is growing beyond being merely an IR. As it becomes more of a > >> >>>> standard, its adoption will be associated with other benefits, such > >> as > >> >>> the > >> >>>> ability to verify standard compliance.5. Despite the advantages of > >> >> using > >> >>>> the ONNX route described in #4, there are some costs. The main one > is > >> >>> the > >> >>>> dependency on Protobuf. This is a valid criticism on the surface, > >> >>> however, > >> >>>> since the TensorRT integration requires an opt-in during build > time, > >> >>> adding > >> >>>> one more dependency is not a problem if it is not a mandatory > >> >> dependency. > >> >>>> Moreover, the same Protobuf dependency already exists for the MxNet > >> >> ONNX > >> >>>> importer, which is now part of the MxNet source tree (link > >> >>>> < > >> >>> > >> >> https://github.com/apache/incubator-mxnet/blob/76417594e56a8 > >> 5ec0cc9412b9dd2c7e2ab581d8b/docs/api/python/contrib/onnx.md > >> >>>> ), > >> >>>> rather than being located in a separate repository. Just like the > use > >> >> of > >> >>>> the ONNX importer is optional and requires ONNX (and hence also > >> >>> Protobuf), > >> >>>> the TensorRT build is optional. 6. The optional integration of > >> TensorRT > >> >>>> will be guarded using a config.mk <http://config.mk> flag > >> >>> (USE_TENSORRT), > >> >>>> which will function similarly to other flags, such as USE_CUDA, > >> >>> USE_CUDNN, > >> >>>> etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.7. > >> In > >> >>>> order to simplify evaluation of the TensorRT build, usability and > to > >> >> run > >> >>>> unit tests, the PR will come with a Dockerfile, which will allow > >> anyone > >> >>> to > >> >>>> build MxNet with TensorRT, along with its dependencies, i.e. > Protobuf > >> >> and > >> >>>> ONNX. APIs / user experienceThere is no change in the inference > APIs, > >> >>>> except for the need to set the MXNET_USE_TENSORRT environment > >> variable > >> >> to > >> >>>> 1. For example, in Python, we can simply > >> >>>> do:os.environ["MXNET_USE_TENSORRT"] = “1”Note that for backward > >> >>>> compatibility, if the environment variable is not set, it will > >> default > >> >> to > >> >>>> 0. Also, unlike some other environment variables that are only > >> checked > >> >>>> during MxNet initialization, this one gets checked every time graph > >> >>> binding > >> >>>> happens. This typically happens only once during the inference > >> >>>> application’s life cycle, but since one can re-bind a symbol to say > >> >>> compare > >> >>>> a TensorRT and a non-TensorRT run, the check will happen during > each > >> >>>> bind/re-bind to enable that. Since the TensorRT graph pass is > enabled > >> >>> using > >> >>>> an environment variable, no break in the C++, C or any frontend > >> >> language > >> >>>> API is needed. Note that there is one more change required - in > >> calling > >> >>>> simple bind. This doesn’t change the simple bind API, but how it’s > >> >> called > >> >>>> relative to the “usual” case, by using some of the arguments which > >> are > >> >>>> optional. This has to do with the shared_buffer parameter. Before > >> >>>> explaining how the call changes, let’s consider why it’s necessary: > >> 1. > >> >>> The > >> >>>> TensorRT graph needs to be constructed during the simple bind call, > >> but > >> >>>> before memory gets allocated for the non-TensorRT part of the > graph. > >> 2. > >> >>>> TensorRT needs the weights, not just the shapes, to be provided > >> before > >> >>> the > >> >>>> engine is constructed - it will store them inside the ICudaEngine > >> >> object. > >> >>>> The engine will then be serialized inside the NNVM TensorRT op, and > >> >>>> deserialized when the graph executor takes over. This means that > the > >> >>>> weights need to be provided to the simple bind call to construct > the > >> >>>> TensorRT engine.3. The way to provide the weights is to hand them > >> over > >> >> to > >> >>>> the simple bind call via the “shared buffer” argument. The shared > >> >> buffer > >> >>>> weights can be provided during the bind call and can be freed by > the > >> >>>> frontend language once binding is complete (e.g. by exiting the > >> >> relevant > >> >>>> scope in Python, or calling del).Since we need both arg_params > >> >> (weights) > >> >>>> and aux_params (e.g. BatchNorm moments), we need to merge > arg_params > >> >> and > >> >>>> aux_params into one dictionary. Here’s a Python example:def > >> >>>> merge_dicts(*dict_args): """Merge arg_params and aux_params to > >> >>> populate > >> >>>> shared_buffer""" result = {} for dictionary in dict_args: > >> >>>> result.update(dictionary) return resultNow let’s see a use > >> >>>> example:device = mx.gpu(0)sym, arg_params, aux_params = > >> >>>> mx.model.load_checkpoint(model_name, num_epochs)executor = > >> >>>> sym.simple_bind(ctx=device, data=data_shape, > >> >>>> softmax_label=(batch_size,), > >> >> shared_buffer=merge_dicts(arg_params, > >> >>>> aux_params),, grad_req='null', force_rebind=True)Now we can > >> >> simply > >> >>>> update data in the executor’s arg dict and run the forward > >> >>>> pass:executor.arg_dict["data"][:] = > >> >>>> my_data_batchexecutor.forward(is_train=False)predictions = > >> >>>> executor.outputs[0].asnumpy()Limitations of initial integration and > >> >>>> suggested future work 1. Since the new accelerator API proposal > (link > >> >>>> < > >> >>> > >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+ > >> integration+with+external+acceleration+libraries > >> >>>> ) > >> >>>> was only published a few days ago and the implementation is still > on > >> an > >> >>>> MxNet fork, the current TensorRT integration doesn’t use that API > >> yet, > >> >>> but > >> >>>> could be refactored in a future commit to use it. There is nothing > in > >> >> the > >> >>>> current design that would prevent making use of that API in the > near > >> >>>> future.2. Building the TensorRT engine takes a non-trivial amount > of > >> >>> time, > >> >>>> because the compiler evaluates performance and the hardware on the > >> >> system > >> >>>> before creating the fused layers on demand, and then needs to > >> actually > >> >>>> compile them. For ResNet-50 this may be a few seconds, but larger > >> >> models > >> >>>> also exist which may take longer. TensorRT comes with the ability > to > >> >>>> serialize the TensorRT engine for a particular hardware platform. > >> This > >> >> is > >> >>>> called the serialization of a TensorRT plan, which is the engine > >> along > >> >>> with > >> >>>> the ahead-of-time-compiled fused kernels for a given GPU. The first > >> PR > >> >> of > >> >>>> the TensorRT integration will not provide for TensorRT plan > caching, > >> so > >> >>>> using TensorRT might have a small start-up cost, but for > long-running > >> >>>> inference processes, this shouldn’t be a problem. Caching the > >> TensorRT > >> >>> plan > >> >>>> will be addressed in a future commit.3. As mentioned before, the > >> >>>> reproducibility of the build will be demonstrated using a Docker > file > >> >>> that > >> >>>> will provide an easy way to evaluate the build. The Docker recipe > was > >> >>>> tested on Linux on x86_64, but not other platforms supported by > >> >> TensorRT > >> >>>> (Linux on 64-bit ARM (aarch64), Android on aarch64, QNX on > aarch64). > >> >>>> Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e. > >> Linux > >> >>> for > >> >>>> Tegra, on the NVIDIA Jetson platform) is left for subsequent > commits. > >> >> 4. > >> >>>> The current commit supports many, but not all, of TensorRT > operators. > >> >> For > >> >>>> example, this integration can run CNNs such as VGG, or ResNet, but > >> not > >> >>>> necessarily everything that TensorRT can support. More operators > will > >> >> be > >> >>>> covered in future commits.5. TensorRT supports plugins, which can > be > >> >>>> integrated into the graph pass. However, this was not a priority > >> since > >> >>> the > >> >>>> runtime TensorRT integration can always fall back to existing MxNet > >> >>>> operators. Supporting plugins is possible, but will be added in > >> future > >> >>>> commits.6. The upcoming PR will support fp16 and fp32, but not > int8. > >> >>> Since > >> >>>> int8 support in MxNet is itself very new, figuring out calibration > >> and > >> >>>> other details is left for a future commit.7. TensorRT 4 is going to > >> >> have > >> >>> a > >> >>>> new feature called BYOM (bring your own memory). This means that > >> >> instead > >> >>> of > >> >>>> telling TensorRT how much memory it can use, the data/scratch space > >> >>> tensors > >> >>>> can be provided by MxNet, and can be re-used by MxNet when not > >> running > >> >>> the > >> >>>> forward pass. The memory in permanent use will then be limited to > >> >>> TensorRT > >> >>>> storing weights. Support for this feature will be added in a future > >> >>> commit.* > >> >>>> > >> >>> > >> >> > >> > > > > >
