lhutton1 commented on a change in pull request #6395:
URL: https://github.com/apache/incubator-tvm/pull/6395#discussion_r490817376



##########
File path: docs/deploy/tensorrt.rst
##########
@@ -0,0 +1,267 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+Relay TensorRT Integration
+==============================================
+**Author**: `Trevor Morris <https://github.com/trevor-m>`_
+
+Introduction
+------------
+
+NVIDIA TensorRT is a library for optimized deep learning inference. This 
integration will offload as
+many operators as possible from Relay to TensorRT, providing a performance 
boost on NVIDIA GPUs
+without the need to tune schedules.
+
+Installing TensorRT
+------------------------------
+
+In order to download TensorRT, you will need to create an NVIDIA Developer 
program account. Please
+see NVIDIA's documentation for more info:
+https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html. If you 
have a Jetson device
+such as a TX1, TX2, Xavier, or Nano, TensorRT will already be installed on the 
device via the
+JetPack SDK.
+
+There are two methods to install TensorRT:
+
+* System install via deb or rpm package.
+* Tar file installation.
+
+With the tar file installation method, you must provide the path of the 
extracted tar archive to
+USE_TENSORT_GRAPH_RUNTIME=/path/to/TensorRT. With the system install method,
+USE_TENSORT_GRAPH_RUNTIME=ON will automatically locate your installation.
+
+Building TVM with TensorRT support
+-------------------------
+
+There are two separate build flags for TensorRT integration in TVM:
+
+* USE_TENSORT=ON/OFF - This flag will enable compiling a TensorRT module, 
which does not require any
+TensorRT library.
+* USE_TENSORT_GRAPH_RUNTIME=ON/OFF/path-to-TensorRT - This flag will enable 
the TensorRT runtime
+module. This will build TVM against the TensorRT libraries.
+
+Example setting in config.cmake file:
+
+.. code:: cmake
+
+    set(USE_TENSORRT ON)
+    set(USE_TENSORRT_GRAPH_RUNTIME /home/ubuntu/TensorRT-7.0.0.11)
+
+
+Usage
+-----
+
+.. note::
+
+    This section may not stay up-to-date with changes to the API.
+
+Create a relay graph from a MXNet ResNet18 model.
+
+.. code:: python
+
+    import tvm
+    from tvm import relay
+    import mxnet
+    from mxnet.gluon.model_zoo.vision import get_model
+
+    dtype = "float32"
+    input_shape = (1, 3, 224, 224)
+    block = get_model('resnet18_v1', pretrained=True)
+    mod, params = relay.frontend.from_mxnet(block, shape={'data': 
input_shape}, dtype=dtype)
+
+
+Annotate and partition the graph for TensorRT. All ops which are supported by 
the TensorRT
+integration will be marked and offloaded to TensorRT. The rest of the ops will 
go through the
+regular TVM CUDA compilation and code generation.
+
+.. code:: python
+
+    from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
+    mod, config = partition_for_tensorrt(mod, params)
+
+
+Build the Relay graph, using the new module and config returned by 
partition_for_tensorrt. The
+target must always be "cuda".
+
+.. code:: python
+
+    target = "cuda"
+    with tvm.transform.PassContext(opt_level=3, 
config={'relay.ext.tensorrt.options': config}):
+        lib = relay.build(mod, target=target, params=params)
+
+
+Export the module.
+
+.. code:: python
+
+    lib.export_library('compiled.so')
+
+
+Run inference.
+
+.. code:: python
+
+    ctx = tvm.gpu(0)
+    loaded_lib = tvm.runtime.load_module('compiled.so')
+    gen_module = 
tvm.contrib.graph_runtime.GraphModule(loaded_lib['default'](ctx))
+    input_data = np.random.uniform(0, 1, input_shape).astype(dtype)
+    gen_module.run(data=input_data)
+
+
+Operator support
+----------------
++------------------------+------------------------------------+
+|       Relay Node       |              Remarks               |
++========================+====================================+
+| nn.relu                |                                    |
++------------------------+------------------------------------+
+| sigmoid                |                                    |
++------------------------+------------------------------------+
+| tanh                   |                                    |
++------------------------+------------------------------------+
+| nn.batch_norm          |                                    |
++------------------------+------------------------------------+
+| nn.softmax             |                                    |
++------------------------+------------------------------------+
+| nn.conv2d              |                                    |
++------------------------+------------------------------------+
+| nn.dense               |                                    |
++------------------------+------------------------------------+
+| nn.bias_add            |                                    |
++------------------------+------------------------------------+
+| add                    |                                    |
++------------------------+------------------------------------+
+| subtract               |                                    |
++------------------------+------------------------------------+
+| multiply               |                                    |
++------------------------+------------------------------------+
+| divide                 |                                    |
++------------------------+------------------------------------+
+| power                  |                                    |
++------------------------+------------------------------------+
+| maximum                |                                    |
++------------------------+------------------------------------+
+| minimum                |                                    |
++------------------------+------------------------------------+
+| nn.max_pool2d          |                                    |
++------------------------+------------------------------------+
+| nn.avg_pool2d          |                                    |
++------------------------+------------------------------------+
+| nn.global_max_pool2d   |                                    |
++------------------------+------------------------------------+
+| nn.global_avg_pool2d   |                                    |
++------------------------+------------------------------------+
+| exp                    |                                    |
++------------------------+------------------------------------+
+| log                    |                                    |
++------------------------+------------------------------------+
+| sqrt                   |                                    |
++------------------------+------------------------------------+
+| abs                    |                                    |
++------------------------+------------------------------------+
+| negative               |                                    |
++------------------------+------------------------------------+
+| nn.batch_flatten       |                                    |
++------------------------+------------------------------------+
+| expand_dims            |                                    |
++------------------------+------------------------------------+
+| squeeze                |                                    |
++------------------------+------------------------------------+
+| concatenate            |                                    |
++------------------------+------------------------------------+
+| nn.conv2d_transpose    |                                    |
++------------------------+------------------------------------+
+| transpose              |                                    |
++------------------------+------------------------------------+
+| layout_transform       |                                    |
++------------------------+------------------------------------+
+| reshape                |                                    |
++------------------------+------------------------------------+
+| nn.pad                 |                                    |
++------------------------+------------------------------------+
+| sum                    |                                    |
++------------------------+------------------------------------+
+| prod                   |                                    |
++------------------------+------------------------------------+
+| max                    |                                    |
++------------------------+------------------------------------+
+| min                    |                                    |
++------------------------+------------------------------------+
+| mean                   |                                    |
++------------------------+------------------------------------+
+| nn.adaptive_max_pool2d |                                    |
++------------------------+------------------------------------+
+| nn.adaptive_avg_pool2d |                                    |
++------------------------+------------------------------------+
+| clip                   | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| nn.leaky_relu          | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| sin                    | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| cos                    | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| atan                   | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| ceil                   | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| floor                  | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| strided_slice          | Requires TensorRT 5.1.5 or greater |
++------------------------+------------------------------------+
+| nn.conv3d              | Requires TensorRT 6.0.1 or greater |
++------------------------+------------------------------------+
+| nn.max_pool3d          | Requires TensorRT 6.0.1 or greater |
++------------------------+------------------------------------+
+| nn.avg_pool3d          | Requires TensorRT 6.0.1 or greater |
++------------------------+------------------------------------+
+| nn.conv3d_transpose    | Requires TensorRT 6.0.1 or greater |
++------------------------+------------------------------------+
+
+
+Runtime Settings
+----------------
+
+There are some additional options which can be configured at runtime using 
environment variables.
+
+* Automatic FP16 Conversion - Environment variable ``TVM_TENSORRT_USE_FP16=1`` 
can be set to
+  automatically convert the TensorRT components of your model to 16-bit 
floating point precision.
+  This can greatly increase performance, but may cause some slight loss in the 
model accuracy.
+* During the first inference, the runtime will invoke the TensorRT API to 
build an engine. This can
+  be time consuming, so you can set ``TVM_TENSORRT_CACHE_DIR`` to point to a 
directory to save
+  these built engines to on the disk. The next time you load the model and 
give it the same
+  directory, the runtime will load the already built engines to avoid the long 
warmup time. A
+  unique directory is required for each model.
+* TensorRT has a paramter to configure the maximum amount of scratch space 
that each layer in the
+  model can use. It is generally best to use the highest value which does not 
cause you to run out
+  of memory. You can use ``TVM_TENSORRT_MAX_WORKSPACE_SIZE`` to override this 
by specifying the
+  workspace size in bytes you would like to use.
+
+Adding a new operator
+---------------------
+To add support for a new operator, there are a series of files we need to make 
changes to:
+
+* `python/relay/op/contrib/tensorrt.py` This file contains the annotation 
rules for TensorRT. These
+  determine which operators and their attributes that are supported.
+* `src/runtime/contrib/arm_compute_lib/tensorrt_ops.cc` Create a new op 
converter class which

Review comment:
       ```suggestion
   * `src/runtime/contrib/tensorrt/tensorrt_ops.cc` Create a new op converter 
class which
   ```

##########
File path: src/runtime/contrib/tensorrt/tensorrt_runtime.cc
##########
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file src/runtime/contrib/tensorrt/tensorrt_runtime.cc
+ * \brief JSON runtime implementation for TensorRT.
+ */
+
+#include <dmlc/parameter.h>
+#include <tvm/runtime/ndarray.h>
+#include <tvm/runtime/registry.h>
+
+#include <fstream>
+
+#include "../../file_util.h"
+#include "../json/json_node.h"
+#include "../json/json_runtime.h"
+
+#ifdef TVM_GRAPH_RUNTIME_TENSORRT
+#include "NvInfer.h"
+#include "tensorrt_builder.h"
+#endif
+
+namespace tvm {
+namespace runtime {
+namespace contrib {
+
+using namespace tvm::runtime::json;
+
+class TensorRTRuntime : public JSONRuntimeBase {
+ public:
+  /*!
+   * \brief The TensorRT runtime module. Deserialize the provided functions
+   * on creation and store in the layer cache.
+   *
+   * \param symbol_name The name of the function.
+   * \param graph_json serialized JSON representation of a sub-graph.
+   * \param const_names The names of each constant in the sub-graph.
+   */
+  explicit TensorRTRuntime(const std::string& symbol_name, const std::string& 
graph_json,
+                           const Array<String>& const_names)
+      : JSONRuntimeBase(symbol_name, graph_json, const_names), 
use_implicit_batch_(true),
+        max_workspace_size_(size_t(1) << 30) {}
+
+  /*!
+   * \brief The type key of the module.
+   *
+   * \return module type key.
+   */
+  const char* type_key() const override { return "tensorrt"; }
+
+  /*!
+   * \brief Initialize runtime. Create TensorRT layer from JSON
+   * representation.
+   *
+   * \param consts The constant params from compiled model.
+   */
+  void Init(const Array<NDArray>& consts) override {
+    CHECK_EQ(consts.size(), const_idx_.size())
+        << "The number of input constants must match the number of required.";
+    LoadGlobalAttributes();
+    if (GetCachedEnginesFromDisk()) return;
+    SetupConstants(consts);
+    BuildEngine();
+    CacheEngineToDisk();
+  }
+
+  void LoadGlobalAttributes() {
+    // These settings are global to the entire subgraph. Codegen will add them 
as attributes to all
+    // op nodes. Read from first one.
+    for (size_t i = 0; i < nodes_.size(); ++i) {
+      if (nodes_[i].HasAttr("use_implicit_batch") && 
nodes_[i].HasAttr("max_workspace_size")) {
+        use_implicit_batch_ =
+            
std::stoi(nodes_[i].GetAttr<std::vector<std::string>>("use_implicit_batch")[0]);
+        // Allow max_workspace_size to be overridden at runtime.
+        size_t runtime_max_workspace_size =
+            dmlc::GetEnv("TVM_TENSORRT_MAX_WORKSPACE_SIZE", size_t(0));
+        if (runtime_max_workspace_size != 0) {
+          max_workspace_size_ = runtime_max_workspace_size;
+        } else {
+          max_workspace_size_ =
+              
std::stoul(nodes_[i].GetAttr<std::vector<std::string>>("max_workspace_size")[0]);
+        }
+        return;
+      }
+    }
+  }
+
+#ifdef TVM_GRAPH_RUNTIME_TENSORRT
+  /*! \brief Run inference using built engine. */
+  void Run() override {
+    auto& engine_and_context = trt_engine_cache_.at(symbol_name_);
+    auto engine = engine_and_context.engine;
+    auto context = engine_and_context.context;
+    std::vector<void*> bindings(engine->getNbBindings(), nullptr);
+
+    for (size_t i = 0; i < input_nodes_.size(); ++i) {
+      auto nid = input_nodes_[i];
+      if (nodes_[nid].GetOpType() == "input") {
+        for (size_t j = 0; j < nodes_[nid].GetOpShape().size(); ++j) {
+          uint32_t eid = EntryID(nid, j);
+          const std::string name = nodes_[nid].GetOpName() + "_" + 
std::to_string(j);
+          int binding_index = engine->getBindingIndex(name.c_str());
+          CHECK_NE(binding_index, -1);
+          bindings[binding_index] = data_entry_[eid]->data;
+        }
+      }
+    }
+
+    for (size_t i = 0; i < outputs_.size(); ++i) {
+      uint32_t eid = EntryID(outputs_[i]);
+      const std::string& name = engine_and_context.outputs[i];
+      int binding_index = engine->getBindingIndex(name.c_str());
+      CHECK_NE(binding_index, -1);
+      bindings[binding_index] = data_entry_[eid]->data;
+    }
+
+#if TRT_VERSION_GE(6, 0, 1)
+    if (use_implicit_batch_) {
+      CHECK(context->execute(batch_size_, bindings.data())) << "Running 
TensorRT failed.";
+    } else {
+      CHECK(context->executeV2(bindings.data())) << "Running TensorRT failed.";
+    }
+#else
+    CHECK(context->execute(batch_size_, bindings.data())) << "Running TensorRT 
failed.";
+#endif
+  }
+
+ private:
+  /*!
+   * \brief Build TensorRT engine from JSON representation.
+   */
+  void BuildEngine() {
+    LOG(INFO) << "Building new TensorRT engine for subgraph " << symbol_name_;
+    const bool use_fp16 = dmlc::GetEnv("TVM_TENSORRT_USE_FP16", false);
+    batch_size_ = GetBatchSize();
+    TensorRTBuilder builder(&logger_, max_workspace_size_, 
use_implicit_batch_, use_fp16,
+                            batch_size_);
+
+    // Add inputs and constants.
+    for (size_t i = 0; i < input_nodes_.size(); ++i) {
+      auto nid = input_nodes_[i];
+      const auto& node = nodes_[nid];
+      std::string name = node.GetOpName();
+      if (node.GetOpType() == "input") {
+        builder.AddInput(nid, node);
+      } else {
+        CHECK_EQ(node.GetOpType(), "const");
+        uint32_t eid = EntryID(nid, 0);
+        builder.AddConstant(nid, data_entry_[eid]);
+      }
+    }
+
+    // Add layers.
+    for (size_t nid = 0; nid < nodes_.size(); ++nid) {
+      const auto& node = nodes_[nid];
+      if (node.GetOpType() != "kernel") continue;
+      builder.AddLayer(nid, node);
+    }
+
+    // Add outputs.
+    for (size_t i = 0; i < outputs_.size(); ++i) {
+      builder.AddOutput(outputs_[i]);
+    }
+
+    // Build engine.
+    trt_engine_cache_[symbol_name_] = builder.BuildEngine();
+    LOG(INFO) << "Finished building TensorRT engine for subgraph " << 
symbol_name_;
+  }
+
+  /*! \brief If TVM_TENSORRT_CACHE_DIR is set, will check that directory for
+   * already built TRT engines and load into trt_engine_cache_ so they don't
+   * have to be built at first inference.
+   */
+  bool GetCachedEnginesFromDisk() {

Review comment:
       Could you override the default SaveToBinary in the json runtime and 
optionally save the engine if one exists (and/or based on a config option)? 
When LoadFromBinary is called, since you have defined your own serialization 
method you can check for the existence of the engine and load it back. 
Essentially you have two different serialization/deserialization methods which 
you can alternate between in LoadFrom/SaveToBinary




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to