andres-vv commented on code in PR #24656:
URL: https://github.com/apache/beam/pull/24656#discussion_r1080998539


##########
website/www/site/content/en/documentation/ml/multi-language-inference.md:
##########
@@ -0,0 +1,159 @@
+---
+title: "Cross Language RunInference  "
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Cross Language RunInference
+
+This Cross Language RunInference example shows how to use the 
[RunInference](https://beam.apache.org/documentation/ml/overview/#runinference)
+Transform in a multi-language pipeline. The pipeline is in Java and reads the 
input data from
+GCS. With the help of a 
[PythonExternalTransform](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)
+a composite python transform is called that does the preprocessing, 
postprocessing and inference.
+Lastly, the data is written back to GCS in the Java pipeline.
+
+## NLP model and dataset
+A `bert-base-uncased` model is used to make inference, which is an open-source 
model
+available on [HuggingFace](https://huggingface.co/bert-base-uncased). This 
BERT-model will be
+used to predict the last word of a sentence, based on the context of the 
sentence.
+
+Next to this we also use an [IMDB movie 
reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv)
 dataset, which is  an open-source dataset that is available on Kaggle.  A 
sample of the data after preprocessing is shown below:
+
+| **Text**     |   **Last Word**       |
+|---   |:---   |
+|<img width=700/>|<img width=100/>|
+| One of the other reviewers has mentioned that after watching just 1 Oz 
episode you'll be [MASK]      | hooked        |
+| A wonderful little [MASK]    | production    |
+| So im not a big fan of Boll's work but then again not many [MASK]    | are   
|
+| This a fantastic movie of three prisoners who become [MASK]  | famous        
|
+| Some films just simply should not be [MASK]  | remade        |
+| The Karen Carpenter Story shows a little more about singer Karen Carpenter's 
complex [MASK]  | life  |
+
+The full code used in this example can be found on GitHub 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/multi_language_inference).
+
+
+## Multi-language RunInference pipeline
+### Cross-Language Python transform
+Next to making inference on the data, we also need to perform preprocessing 
and postprocessing on the data. This way the pipeline gives clean output that 
is easily interpreted.  In order to do these three tasks, one single composite 
custom Ptransform is written, with a unit DoFn or PTransform for each of the 
tasks as shown below:
+
+```python
+def expand(self, pcoll):
+    return (
+    pcoll
+    | 'Preprocess' >> beam.ParDo(self.Preprocess(self._tokenizer))
+    | 'Inference' >> RunInference(KeyedModelHandler(self._model_handler))
+    | 'Postprocess' >> beam.ParDo(self.Postprocess(
+        self._tokenizer)).with_input_types(typing.Iterable[str])
+    )
+```
+
+First, the preprocessing is done. In which the raw textual data is cleaned and 
tokenized for the BERT-model. All these steps are executed in the `Preprocess` 
DoFn. The `Preprocess` DoFn takes a single element as input and returns list 
with the original text and the tokenized text.

Review Comment:
   I think you meant: 
   First, the preprocessing of the data.....
   
   I will use that, let me know if you meant something else



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to