[beam] branch master updated: Add TensorRT runinference example for Text Classification (#25226)

damccorm Wed, 08 Feb 2023 06:45:39 -0800

This is an automated email from the ASF dual-hosted git repository.

damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git



The following commit(s) were added to refs/heads/master by this push:
     new 0bec597aabe Add TensorRT runinference example for Text Classification 
(#25226)
0bec597aabe is described below

commit 0bec597aabed951856553211e5bb7c5c8d5c4894
Author: Shubham Krishna <[email protected]>
AuthorDate: Wed Feb 8 15:45:21 2023 +0100

    Add TensorRT runinference example for Text Classification (#25226)
    
    * Add TensorRT runinference example for Text Classification
    
    * Add docstrings, fix linting, formatting
    
    * Fix pylinting for docstrings
    
    * Add partial documentation
    
    * Edit docstrings, documentation, reading of input
    
    * Fix python formatting
    
    * Add benchmarking results
    
    * Add benchmarking, remove print statment
    
    * Edit documentation
    
    * Update 
website/www/site/layouts/partials/section-menu/en/documentation.html
    
    ---------
    
    Co-authored-by: Shubham Krishna <“[email protected]”>
    Co-authored-by: Danny McCormick <[email protected]>
---
 .../inference/tensorrt_text_classification.py      | 126 +++++++++++++++++
 .../site/content/en/documentation/ml/overview.md   |   1 +
 .../en/documentation/ml/tensorrt-runinference.md   | 150 +++++++++++++++++++++
 .../partials/section-menu/en/documentation.html    |   1 +
 4 files changed, 278 insertions(+)

diff --git 
a/sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py 
b/sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py
new file mode 100644
index 00000000000..a5cda68fd79
--- /dev/null
+++ b/sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py
@@ -0,0 +1,126 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads data from a text
+file, preprocesses the data, and then uses RunInference to generate
+predictions from the text classification TensorRT engine. Next,
+it postprocesses the RunInference outputs to print the input and
+the predicted class label.
+It also prints metrics provided by RunInference.
+"""
+
+import argparse
+import logging
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.ml.inference.base import RunInference
+from apache_beam.ml.inference.tensorrt_inference import 
TensorRTEngineHandlerNumPy
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.options.pipeline_options import SetupOptions
+from transformers import AutoTokenizer
+
+
+class Preprocess(beam.DoFn):
+  """Processes the input sentences to tokenize them.
+
+  The input sentences are tokenized because the
+  model is expecting tokens.
+  """
+  def __init__(self, tokenizer: AutoTokenizer):
+    self._tokenizer = tokenizer
+
+  def process(self, element):
+    inputs = self._tokenizer(
+        element, return_tensors="np", padding="max_length", max_length=128)
+    return inputs.input_ids
+
+
+class Postprocess(beam.DoFn):
+  """Processes the PredictionResult to get the predicted class.
+
+  The logits are the output of the TensorRT engine.
+  We can get the class label by getting the index of
+  maximum logit using argmax.
+  """
+  def __init__(self, tokenizer: AutoTokenizer):
+    self._tokenizer = tokenizer
+
+  def process(self, element):
+    decoded_input = self._tokenizer.decode(
+        element.example, skip_special_tokens=True)
+    logits = element.inference[0]
+    argmax = np.argmax(logits)
+    output = "Positive" if argmax == 1 else "Negative"
+    yield decoded_input, output
+
+
+def parse_known_args(argv):
+  """Parses args for the workflow."""
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      '--input',
+      dest='input',
+      required=True,
+      help='Path to the text file containing sentences.')
+  parser.add_argument(
+      '--trt_model_path',
+      dest='trt_model_path',
+      required=True,
+      help='Path to the pre-built textattack/bert-base-uncased-SST-2'
+      'TensorRT engine.')
+  parser.add_argument(
+      '--model_id',
+      dest='model_id',
+      default="textattack/bert-base-uncased-SST-2",
+      help="name of model.")
+  return parser.parse_known_args(argv)
+
+
+def run(
+    argv=None,
+    save_main_session=True,
+):
+  known_args, pipeline_args = parse_known_args(argv)
+  pipeline_options = PipelineOptions(pipeline_args)
+  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
+
+  model_handler = TensorRTEngineHandlerNumPy(
+      min_batch_size=1,
+      max_batch_size=1,
+      engine_path=known_args.trt_model_path,
+  )
+
+  tokenizer = AutoTokenizer.from_pretrained(known_args.model_id)
+
+  with beam.Pipeline(options=pipeline_options) as pipeline:
+    _ = (
+        pipeline
+        | "ReadSentences" >> beam.io.ReadFromText(known_args.input)
+        | "Preprocess" >> beam.ParDo(Preprocess(tokenizer=tokenizer))
+        | "RunInference" >> RunInference(model_handler=model_handler)
+        | "PostProcess" >> beam.ParDo(Postprocess(tokenizer=tokenizer))
+        | "LogResult" >> beam.Map(logging.info))
+  metrics = pipeline.result.metrics().query(beam.metrics.MetricsFilter())
+  logging.info(metrics)
+
+
+if __name__ == '__main__':
+  logging.getLogger().setLevel(logging.INFO)
+  run()
diff --git a/website/www/site/content/en/documentation/ml/overview.md 
b/website/www/site/content/en/documentation/ml/overview.md
index 5d24558987e..dabe7e9629f 100644
--- a/website/www/site/content/en/documentation/ml/overview.md
+++ b/website/www/site/content/en/documentation/ml/overview.md
@@ -94,3 +94,4 @@ You can find examples of end-to-end AI/ML pipelines for 
several use cases:
 * [Anomaly Detection in Beam](/documentation/ml/anomaly-detection): 
Demonstrates how to set up an anomaly detection pipeline that reads text from 
Pub/Sub in real time and then detects anomalies using a trained HDBSCAN 
clustering model with the RunInference API.
 * [Large Language Model Inference in 
Beam](/documentation/ml/large-language-modeling): Demonstrates a pipeline that 
uses RunInference to perform translation with the T5 language model which 
contains 11 billion parameters.
 * [Per Entity Training in Beam](/documentation/ml/per-entity-training): 
Demonstrates a pipeline that trains a Decision Tree Classifier per education 
level for predicting if the salary of a person is >= 50k.
+* [TensorRT Inference](/documentation/ml/tensorrt-runinference): Demonstrates 
a pipeline that uses TensorRT with the RunInference transform and a BERT-based 
text classification model.
diff --git 
a/website/www/site/content/en/documentation/ml/tensorrt-runinference.md 
b/website/www/site/content/en/documentation/ml/tensorrt-runinference.md
new file mode 100644
index 00000000000..e2825f7fe9c
--- /dev/null
+++ b/website/www/site/content/en/documentation/ml/tensorrt-runinference.md
@@ -0,0 +1,150 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Use TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network to 
efficiently run inference on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tenso [...]
+
+- In Apache Beam 2.43.0, Beam introduced the 
[TensorRTEngineHandler](https://beam.apache.org/releases/pydoc/2.43.0/apache_beam.ml.inference.tensorrt_inference.html#apache_beam.ml.inference.tensorrt_inference.TensorRTEngineHandlerNumPy),
 which lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML inference pipeline creation process by allowing 
developers to use Sklearn, PyTorch, TensorFlow and now TensorRT models in 
production pipelines without [...]
+
+The following example that demonstrates how to use TensorRT with the 
RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+## Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis and classifies any text into two classes: positive 
or negative. The trained model is available [from 
HuggingFace](https://huggingface.co/textattack/bert-base-uncased-SST-2). To 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+You can use the HuggingFace `transformers` library to convert a PyTorch model 
to ONNX. For details, see the blog post [Convert Transformers to ONNX with 
Hugging Face 
Optimum](https://huggingface.co/blog/convert-transformers-to-onnx). The blog 
post explains which required packages to install. The following code is used 
for the conversion.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+To convert an ONNX model to a TensorRT engine, use the following command from 
the `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+To use `trtexec`, follow the steps in the blog post [Simplifying and 
Accelerating Machine Learning Predictions in Apache Beam with NVIDIA 
TensorRT](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/).
 The post explains how to build a docker image from a DockerFile that can be 
used for conversion. We use the following Docker file, which is similar to the 
file used in the blog post:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update -y && apt-get install -y python3-venv
+RUN pip install --no-cache-dir apache-beam[gcp]==2.44.0
+COPY --from=apache/beam_python3.8_sdk:2.44.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blog post also contains instructions explaining how to test the TensorRT 
engine locally.
+
+
+## Run TensorRT engine with RunInference in a Beam pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with 
RunInference in a Beam pipeline that can run both locally and on Google Cloud.
+
+The following code example is a part of the pipeline. You use 
`TensorRTEngineHandlerNumPy` to load the TensorRT engine and to set other 
inference parameters.
+
+```
+  model_handler = TensorRTEngineHandlerNumPy(
+      min_batch_size=1,
+      max_batch_size=1,
+      engine_path=known_args.trt_model_path,
+  )
+
+  tokenizer = AutoTokenizer.from_pretrained(known_args.model_id)
+
+  with beam.Pipeline(options=pipeline_options) as pipeline:
+    _ = (
+        pipeline
+        | "ReadSentences" >> beam.io.ReadFromText(known_args.input)
+        | "Preprocess" >> beam.ParDo(Preprocess(tokenizer=tokenizer))
+        | "RunInference" >> RunInference(model_handler=model_handler)
+        | "PostProcess" >> beam.ParDo(Postprocess(tokenizer=tokenizer)))
+```
+
+The full code can be found [on 
GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py).
+
+To run this job on Dataflow, run the following command locally:
+
+```
+python tensorrt_text_classification.py \
+--input gs://{GCP_PROJECT}/sentences.txt \
+--trt_model_path gs://{GCP_PROJECT}/sst2-text-classification.trt \
+--runner DataflowRunner \
+--experiment=use_runner_v2 \
+--machine_type=n1-standard-4 \
+--experiment="worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver"
 \
+--disk_size_gb=75 \
+--project {GCP_PROJECT} \
+--region us-central1 \
+--temp_location gs://{GCP_PROJECT}/tmp/ \
+--job_name tensorrt-text-classification \
+--sdk_container_image="us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt tensor_rt"
+```
+
+
+
+## Dataflow Benchmarking
+
+We ran experiments in Dataflow using a TensorRT engine and the following 
configurations: `n1-standard-4` machine with a disk size of `75GB`. To mimic 
data streaming into Dataflow via `PubSub`, we set the batch size to 1 by 
setting the min and max batch sizes for `ModelHandlers` to 1.
+
+|  | Stage with RunInference | Mean inference_batch_latency_micro_secs|
+|:----------:|:----------:|:----------:|
+|    TensorFlow with T4 GPU          | 3 min 1 sec | 15,176 |
+| TensorRT with T4 GPU  | 45 sec | 3,685 |
+
+The Dataflow runner decomposes a pipeline into multiple stages. You can get a 
better picture of the performance of RunInference by looking at the stage that 
contains the inference call, and not the other stages that read and write data. 
This is in the Stage with RunInference column.
+
+The metric `inference_batch_latency_micro_secs` is the time, in microseconds, 
that it takes to perform the inference on the batch of examples, that is, the 
time to call `model_handler.run_inference`. This varies over time depending on 
the dynamic batching decision of BatchElements, and the particular values or 
dtype values of the elements. For this metric, you can see that TensorRT is 
about 4.1x faster than TensorFlow.
\ No newline at end of file
diff --git 
a/website/www/site/layouts/partials/section-menu/en/documentation.html 
b/website/www/site/layouts/partials/section-menu/en/documentation.html
index a46794e73f3..f0046693cfc 100644
--- a/website/www/site/layouts/partials/section-menu/en/documentation.html
+++ b/website/www/site/layouts/partials/section-menu/en/documentation.html
@@ -225,6 +225,7 @@
     <li><a href="/documentation/ml/anomaly-detection/">Anomaly 
Detection</a></li>
     <li><a href="/documentation/ml/large-language-modeling">Large Language 
Model Inference in Beam</a></li>
     <li><a href="/documentation/ml/per-entity-training">Per Entity Training in 
Beam</a></li>
+    <li><a href="/documentation/ml/tensorrt-runinference">TensorRT 
Inference</a></li>
   </ul>
 </li>
 <li class="section-nav-item--collapsible">

[beam] branch master updated: Add TensorRT runinference example for Text Classification (#25226)

Reply via email to