rszper commented on code in PR #25226:
URL: https://github.com/apache/beam/pull/25226#discussion_r1092283439


##########
website/www/site/content/en/documentation/ml/overview.md:
##########
@@ -91,4 +91,5 @@ You can find examples of end-to-end AI/ML pipelines for 
several use cases:
 * [Online Clustering in Beam](/documentation/ml/online-clustering): 
Demonstrates how to set up a real-time clustering pipeline that can read text 
from Pub/Sub, convert the text into an embedding using a transformer-based 
language model with the RunInference API, and cluster the text using BIRCH with 
stateful processing.
 * [Anomaly Detection in Beam](/documentation/ml/anomaly-detection): 
Demonstrates how to set up an anomaly detection pipeline that reads text from 
Pub/Sub in real time and then detects anomalies using a trained HDBSCAN 
clustering model with the RunInference API.
 * [Large Language Model Inference in 
Beam](/documentation/ml/large-language-modeling): Demonstrates a pipeline that 
uses RunInference to perform translation with the T5 language model which 
contains 11 billion parameters.
-* [Per Entity Training in Beam](/documentation/ml/per-entity-training): 
Demonstrates a pipeline that trains a Decision Tree Classifier per education 
level for predicting if the salary of a person is >= 50k.
\ No newline at end of file
+* [Per Entity Training in Beam](/documentation/ml/per-entity-training): 
Demonstrates a pipeline that trains a Decision Tree Classifier per education 
level for predicting if the salary of a person is >= 50k.
+* [TensorRT Text Classification 
Inference](/documentation/ml/tensorrt-runinference): Demonstrates a pipeline to 
utilize TensorRT with the RunInference using a BERT-based text classification 
model.

Review Comment:
   ```suggestion
   * [TensorRT Text Classification 
Inference](/documentation/ml/tensorrt-runinference): Demonstrates a pipeline 
that uses TensorRT with the RunInference transform and a BERT-based text 
classification model.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,

Review Comment:
   ```suggestion
   for a text classification model. This pipeline reads in-memory data,
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process

Review Comment:
   ```suggestion
   from the text classification TensorRT engine. Next, it postprocesses
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.

Review Comment:
   ```suggestion
   - [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network to 
efficiently run inference on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process
+the RunInference outputs to print the input and the predicted class label.
+It also prints different metrics provided by RunInference.
+"""
+
+import argparse
+import logging
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.ml.inference.base import RunInference
+from apache_beam.ml.inference.tensorrt_inference import 
TensorRTEngineHandlerNumPy
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.options.pipeline_options import SetupOptions
+from transformers import AutoTokenizer
+
+
+class Preprocess(beam.DoFn):
+  """Processes the input sentence to tokenize them.

Review Comment:
   ```suggestion
     """Processes the input sentences to tokenize them.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process
+the RunInference outputs to print the input and the predicted class label.
+It also prints different metrics provided by RunInference.

Review Comment:
   ```suggestion
   It also prints metrics provided by RunInference.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.

Review Comment:
   ```suggestion
   To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, that is, it classifies any text into two classes: 
positive or negative. The trained model is available [from Hugging 
Face](https://huggingface.co/textattack/bert-base-uncased-SST-2). To convert 
the PyTorch Model to TensorRT engine, you need to first convert the model to 
ONNX and then from ONNX to TensorRT.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:

Review Comment:
   ```suggestion
   To convert an ONNX model to a TensorRT engine, use the following command 
from the `CLI`:
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this 
[blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/)
 which builds a docker image built from a DockerFile. The dockerFile that we 
used is similar to it and can be found below:

Review Comment:
   ```suggestion
   To use `trtexec`, follow the steps in the blog post [Simplifying and 
Accelerating Machine Learning Predictions in Apache Beam with NVIDIA 
TensorRT](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/).
 The post explains how to build a docker image from a Docker file. We use the 
following Docker file, which is similar to the file used in the blog post:
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference

Review Comment:
   ```suggestion
   # Use TensorRT with RunInference
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this 
[blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/)
 which builds a docker image built from a DockerFile. The dockerFile that we 
used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine 
locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline

Review Comment:
   ```suggestion
   ## Run TensorRT engine with RunInference in a Beam pipeline
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this 
[blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/)
 which builds a docker image built from a DockerFile. The dockerFile that we 
used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine 
locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with 
RunInferece in a Beam pipeline that can be run both locally and on GCP.

Review Comment:
   ```suggestion
   Now that you have the TensorRT engine, you can use TensorRT engine with 
RunInference in a Beam pipeline that can run both locally and on Google Cloud.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction

Review Comment:
   ```suggestion
   preprocesses the data, and then uses RunInference for to generate predictions
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this 
[blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/)
 which builds a docker image built from a DockerFile. The dockerFile that we 
used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine 
locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with 
RunInferece in a Beam pipeline that can be run both locally and on GCP.
+
+The following code example is a part of the pipeline, where you use 
`TensorRTEngineHandlerNumPy` to load the TensorRT engine and set other 
inference parameters.
+
+```
+  model_handler = TensorRTEngineHandlerNumPy(
+      min_batch_size=1,
+      max_batch_size=1,
+      engine_path=known_args.trt_model_path,
+  )
+
+  task_sentences = [
+      "Hello, my dog is cute",
+      "I hate you",
+      "Shubham Krishna is a good coder",
+  ] * 4000
+
+  tokenizer = AutoTokenizer.from_pretrained(known_args.model_id)
+
+  with beam.Pipeline(options=pipeline_options) as pipeline:
+    _ = (
+        pipeline
+        | "CreateInputs" >> beam.Create(task_sentences)
+        | "Preprocess" >> beam.ParDo(Preprocess(tokenizer=tokenizer))
+        | "RunInference" >> RunInference(model_handler=model_handler)
+        | "PostProcess" >> beam.ParDo(Postprocess(tokenizer=tokenizer)))
+```
+
+The full code can be found [here]().

Review Comment:
   Link text (currently [here]) should be the title of the page being linked 
to. If it's code on GitHub, use [on GitHub].



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.

Review Comment:
   ```suggestion
   The following example demonstrates how to use TensorRT with the RunInference 
API using a BERT-based text classification model in a Beam pipeline.
   ```



##########
sdks/python/apache_beam/examples/inference/tensorrt_text_classification.py:
##########
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline to demonstrate usage of TensorRT with RunInference
+for a text classification model. This pipeline reads in memory data,
+does some preprocessing and then uses RunInference for getting prediction
+from the text classification TensorRT engine. Afterwards, it post process
+the RunInference outputs to print the input and the predicted class label.
+It also prints different metrics provided by RunInference.
+"""
+
+import argparse
+import logging
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.ml.inference.base import RunInference
+from apache_beam.ml.inference.tensorrt_inference import 
TensorRTEngineHandlerNumPy
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.options.pipeline_options import SetupOptions
+from transformers import AutoTokenizer
+
+
+class Preprocess(beam.DoFn):
+  """Processes the input sentence to tokenize them.
+
+  The input sentences are tokenized as the

Review Comment:
   ```suggestion
     The input sentences are tokenized because the
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.

Review Comment:
   ```suggestion
   You can use the Hugging Face `transformers` library to convert a PyTorch 
model to ONNX. For details, see the blog post [Convert Transformers to ONNX 
with Hugging Face 
Optimum](https://huggingface.co/blog/convert-transformers-to-onnx). The blog 
post explains which required packages to install. The following code is used 
for the conversion.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this 
[blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/)
 which builds a docker image built from a DockerFile. The dockerFile that we 
used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine 
locally.

Review Comment:
   ```suggestion
   The blog post also contains instructions explaining how to test the TensorRT 
engine locally.
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference

Review Comment:
   ```suggestion
   ## Build a TensorRT engine for inference
   ```



##########
website/www/site/content/en/documentation/ml/tensorrt-runinference.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "TensorRT RunInference"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Using TensorRT with RunInference
+- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) is an SDK that 
facilitates high-performance machine learning inference. It is designed to work 
with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It 
focuses specifically on optimizing and running a trained neural network for 
inference efficiently on NVIDIA GPUs. TensorRT can maximize inference 
throughput with multiple optimizations while preserving model accuracy 
including model quantization, layer and tensor fusions, kernel auto-tuning, 
multi-stream executions, and efficient tensor memory usage.
+
+- In Apache Beam 2.43.0, Beam introduced the `TensorRTEngineHandler`, which 
lets you deploy a TensorRT engine in a Beam pipeline. The RunInference 
transform simplifies the ML pipeline creation process by allowing developers to 
use Sklearn, PyTorch, TensorFlow and now TensorRT models in production 
pipelines without needing lots of boilerplate code.
+
+Below, you can find an example that demonstrates how to utilize TensorRT with 
the RunInference API using a BERT-based text classification model in a Beam 
pipeline.
+
+# Build a TensorRT engine for inference
+To use TensorRT with Apache Beam, you need a converted TensorRT engine file 
from a trained model. We take a trained BERT based text classification model 
that does sentiment analysis, i.e. classifies any text into two classes: 
positive or negative. The trained model is easily available 
[here](https://huggingface.co/textattack/bert-base-uncased-SST-2). In order to 
convert the PyTorch Model to TensorRT engine, you need to first convert the 
model to ONNX and then from ONNX to TensorRT.
+
+### Conversion to ONNX
+
+Using HuggingFace `transformers` libray, one can easily convert a PyTorch 
model to ONNX. A detailed blogpost can be found 
[here](https://huggingface.co/blog/convert-transformers-to-onnx) which also 
mentions the required packages to install. The code that we used for the 
conversion can be found below.
+
+```
+from pathlib import Path
+import transformers
+from transformers.onnx import FeaturesManager
+from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM, 
AutoModelForSequenceClassification
+
+
+# load model and tokenizer
+model_id = "textattack/bert-base-uncased-SST-2"
+feature = "sequence-classification"
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# load config
+model_kind, model_onnx_config = 
FeaturesManager.check_supported_model_or_raise(model, feature=feature)
+onnx_config = model_onnx_config(model.config)
+
+# export
+onnx_inputs, onnx_outputs = transformers.onnx.export(
+        preprocessor=tokenizer,
+        model=model,
+        config=onnx_config,
+        opset=12,
+        output=Path("bert-sst2-model.onnx")
+)
+```
+
+### From ONNX to TensorRT engine
+
+In order to convert an ONNX model to a TensorRT engine you can use the 
following command from `CLI`:
+```
+trtexec --onnx=<path to onnx model> --saveEngine=<path to save TensorRT 
engine> --useCudaGraph --verbose
+```
+
+For using `trtexec`, you can follow this 
[blogpost](https://developer.nvidia.com/blog/simplifying-and-accelerating-machine-learning-predictions-in-apache-beam-with-nvidia-tensorrt/)
 which builds a docker image built from a DockerFile. The dockerFile that we 
used is similar to it and can be found below:
+
+```
+ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.05-py3
+
+FROM ${BUILD_IMAGE}
+
+ENV PATH="/usr/src/tensorrt/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository universe && \
+    apt-get update && \
+    apt-get install -y python3.8-venv
+
+RUN pip install --no-cache-dir apache-beam[gcp]==2.43.0
+COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
+
+RUN pip install --upgrade pip \
+    && pip install torch==1.13.1 \
+    && pip install torchvision>=0.8.2 \
+    && pip install pillow>=8.0.0 \
+    && pip install transformers>=4.18.0 \
+    && pip install cuda-python
+
+ENTRYPOINT [ "/opt/apache/beam/boot" ]
+```
+The blogpost also contains the instructions on how to test the TensorRT engine 
locally.
+
+
+## Running TensorRT Engine with RunInference in a Beam Pipeline
+
+Now that you have the TensorRT engine, you can use TensorRT engine with 
RunInferece in a Beam pipeline that can be run both locally and on GCP.
+
+The following code example is a part of the pipeline, where you use 
`TensorRTEngineHandlerNumPy` to load the TensorRT engine and set other 
inference parameters.

Review Comment:
   ```suggestion
   The following code example is a part of the pipeline. You use 
`TensorRTEngineHandlerNumPy` to load the TensorRT engine and to set other 
inference parameters.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to