[GitHub] [beam] yeandy commented on a diff in pull request #22069: Reviewing the RunInference ReadMe file for clarity.

GitBox Fri, 15 Jul 2022 08:19:32 -0700


yeandy commented on code in PR #22069:
URL: https://github.com/apache/beam/pull/22069#discussion_r918927434



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -32,55 +32,70 @@ pip install apache-beam==2.40.0
 
 ### PyTorch dependencies
 
+The following installation requirements are for the files used in these 
examples.
+
 The RunInference API supports the PyTorch framework. To use PyTorch locally, 
first install `torch`.
 ```
-pip install torch==1.11.0
+pip install torch==1.10.0
 ```
 
-If you are using pretrained models from Pytorch's `torchvision.models` 
[subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights),
 you may also need to install `torchvision`.
+If you are using pretrained models from Pytorch's `torchvision.models` 
[subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights),
 you also need to install `torchvision`.
 ```
 pip install torchvision
 ```
 
-If you are using pretrained models from Hugging Face's `transformers` 
[package](https://huggingface.co/docs/transformers/index), you may also need to 
install `transformers`.
+If you are using pretrained models from Hugging Face's `transformers` 
[package](https://huggingface.co/docs/transformers/index), you also need to 
install `transformers`.
 ```
 pip install transformers
 ```
 
-For installation of the `torch` dependency on a distributed runner, like 
Dataflow, refer to these 
[instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies).
+For installation of the `torch` dependency on a distributed runner such as 
Dataflow, refer to the 
+[PyPI dependency 
instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies).
+
+RunInference uses dynamic batching. However, the RunInference API cannot batch 
tensor elements of different sizes, because `torch.stack()` expects tensors of 
the same length. If you provide images of different sizes or word embeddings of 
different lengths, errors might occur.
+
+To avoid this issue:
+
+1. Either use elements that have the same size, or resize image inputs and 
word embeddings to make them 
+the same size. Depending on the language model and encoding technique, this 
option might not be available. 
+2. Disable batching by overriding the `batch_elements_kwargs` function in your 
ModelHandler and setting the maximum batch size (`max_batch_size`) to one: 
`max_batch_size=1`. For more information, see BatchElements PTransforms.
 
 <!---
 TODO: Add link to full documentation on Beam website when it's published.
 
-i.e. "See the
-[documentation](https://beam.apache.org/documentation/dsls/dataframes/overview/#pre-requisites)
-for details."
+i.e. "For more information, see the
+[Machine 
Learning](https://beam.apache.org/documentation/sdks/python-machine-learning/) 
documentation."
+
+Also relevant: 
https://beam.apache.org/documentation/transforms/python/elementwise/runinference/
 -->
 
 ---
 ## Image classification
 
-[`pytorch_image_classification.py`](./pytorch_image_classification.py) 
contains an implementation for a RunInference pipeline that performs image 
classification using the mobilenet_v2 architecture.
+[`pytorch_image_classification.py`](./pytorch_image_classification.py) 
contains an implementation for a RunInference pipeline that performs image 
classification using the `mobilenet_v2` architecture.
 
-The pipeline reads the images, performs basic preprocessing, passes them to 
the PyTorch implementation of RunInference, and then writes the predictions to 
a text file.
+The pipeline reads the images, performs basic preprocessing, passes the images 
to the PyTorch implementation of RunInference, and then writes the predictions 
to a text file.
 
 ### Dataset and model for image classification
 
-You will need to create or download images, and place them into your 
`IMAGES_DIR` directory. One popular dataset is from 
[ImageNet](https://www.image-net.org/). Please follow their instructions to 
download the images.
-- **Required**: A path to a file called `IMAGE_FILE_NAMES` that contains the 
absolute paths of each of the images in `IMAGES_DIR` on which you want to run 
image segmentation. Paths can be different types of URIs such as your local 
file system, a AWS S3 bucket or GCP Cloud Storage bucket. For example:
+To use this transform, you need a dataset and model for image classification.
+
+1. Create a directory named `IMAGES_DIR`. Create or download images and put 
them in this directory. The directory is not required if image names in the 
input file `IMAGE_FILE_NAMES` have absolute paths.
+One popular dataset is from [ImageNet](https://www.image-net.org/). Follow 
their instructions to download the images.
+2. Create a file named `IMAGE_FILE_NAMES` that contains the absolute paths of 
each of the images in `IMAGES_DIR` that you want to use to run image 
classification. The path to the file can be different types of URIs such as 
your local file system, an AWS S3 bucket, or a GCP Cloud Storage bucket. For 
example:
 ```
 /absolute/path/to/image1.jpg
 /absolute/path/to/image2.jpg
 ```
-- **Required**: A path to a file called `MODEL_STATE_DICT` that contains the 
saved parameters of the maskrcnn_resnet50_fpn model. You will need to download 
the 
[mobilenet_v2](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv2.html)
 model from Pytorch's repository of pretrained models. Note that this requires 
`torchvision` library.
+3. Download the 
[mobilenet_v2](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv2.html)
 model from Pytorch's repository of pretrained models. This model requires the 
torchvision library. To download this model, run the following commands:
 ```
 import torch
 from torchvision.models.detection import mobilenet_v2

Review Comment:
   I have a type here. sorry about that!
   ```suggestion
   from torchvision.models import mobilenet_v2
   ```



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -157,26 +163,30 @@ Each line has data separated by a semicolon ";". The 
first item is the file name
 ---
 ## Language modeling
 
-[`pytorch_language_modeling.py`](./pytorch_language_modeling.py) contains an 
implementation for a RunInference pipeline that performs masked language 
modeling (i.e. decoding a masked token in a sentence) using the BertForMaskedLM 
architecture from Hugging Face.
+[`pytorch_language_modeling.py`](./pytorch_language_modeling.py) contains an 
implementation for a RunInference pipeline that performs masked language 
modeling (that is, decoding a masked token in a sentence) using the 
`BertForMaskedLM` architecture from Hugging Face.
 
 The pipeline reads sentences, performs basic preprocessing to convert the last 
word into a `[MASK]` token, passes the masked sentence to the PyTorch 
implementation of RunInference, and then writes the predictions to a text file.
 
 ### Dataset and model for language modeling
 
-- **Required**: A path to a file called `MODEL_STATE_DICT` that contains the 
saved parameters of the BertForMaskedLM model. You will need to download the 
[BertForMaskedLM](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM)
 model from Hugging Face's repository of pretrained models. Make sure you have 
installed `transformers` too.
+To use this transform, you need a dataset and model for language modeling. 

Review Comment:
   ```suggestion
   To use this transform, you need a dataset and model for language modeling.
   ```



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -32,55 +32,57 @@ pip install apache-beam==2.40.0
 
 ### PyTorch dependencies
 
+The following installation requirements are for the files used in these 
examples.
+
 The RunInference API supports the PyTorch framework. To use PyTorch locally, 
first install `torch`.
 ```
-pip install torch==1.11.0
+pip install torch==1.10.0
 ```
 
-If you are using pretrained models from Pytorch's `torchvision.models` 
[subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights),
 you may also need to install `torchvision`.
+If you are using pretrained models from Pytorch's `torchvision.models` 
[subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights),
 you also need to install `torchvision`.
 ```
 pip install torchvision
 ```
 
-If you are using pretrained models from Hugging Face's `transformers` 
[package](https://huggingface.co/docs/transformers/index), you may also need to 
install `transformers`.
+If you are using pretrained models from Hugging Face's `transformers` 
[package](https://huggingface.co/docs/transformers/index), you also need to 
install `transformers`.
 ```
 pip install transformers
 ```
 
-For installation of the `torch` dependency on a distributed runner, like 
Dataflow, refer to these 
[instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies).
-
-<!---
-TODO: Add link to full documentation on Beam website when it's published.
+For installation of the `torch` dependency on a distributed runner such as 
Dataflow, refer to the 

Review Comment:
   ```suggestion
   For installation of the `torch` dependency on a distributed runner such as 
Dataflow, refer to the
   ```



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -32,55 +32,70 @@ pip install apache-beam==2.40.0
 
 ### PyTorch dependencies
 
+The following installation requirements are for the files used in these 
examples.
+
 The RunInference API supports the PyTorch framework. To use PyTorch locally, 
first install `torch`.
 ```
-pip install torch==1.11.0
+pip install torch==1.10.0
 ```
 
-If you are using pretrained models from Pytorch's `torchvision.models` 
[subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights),
 you may also need to install `torchvision`.
+If you are using pretrained models from Pytorch's `torchvision.models` 
[subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights),
 you also need to install `torchvision`.
 ```
 pip install torchvision
 ```
 
-If you are using pretrained models from Hugging Face's `transformers` 
[package](https://huggingface.co/docs/transformers/index), you may also need to 
install `transformers`.
+If you are using pretrained models from Hugging Face's `transformers` 
[package](https://huggingface.co/docs/transformers/index), you also need to 
install `transformers`.
 ```
 pip install transformers
 ```
 
-For installation of the `torch` dependency on a distributed runner, like 
Dataflow, refer to these 
[instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies).
+For installation of the `torch` dependency on a distributed runner such as 
Dataflow, refer to the 
+[PyPI dependency 
instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies).
+
+RunInference uses dynamic batching. However, the RunInference API cannot batch 
tensor elements of different sizes, because `torch.stack()` expects tensors of 
the same length. If you provide images of different sizes or word embeddings of 
different lengths, errors might occur.
+
+To avoid this issue:
+
+1. Either use elements that have the same size, or resize image inputs and 
word embeddings to make them 
+the same size. Depending on the language model and encoding technique, this 
option might not be available. 
+2. Disable batching by overriding the `batch_elements_kwargs` function in your 
ModelHandler and setting the maximum batch size (`max_batch_size`) to one: 
`max_batch_size=1`. For more information, see BatchElements PTransforms.

Review Comment:
   This should have its own subheader I think. Doesn't belong to the 
`Dependencies` section. Could we call it something like `Usage Notes` or 
`Notes`, and have it be in the same header level as `Prerequisites` (i.e. `## 
Usage Notes`)? 



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -108,27 +110,31 @@ This writes the output to the `predictions.csv` with 
contents like:
 ---
 ## Image segmentation
 
-[`pytorch_image_segmentation.py`](./pytorch_image_segmentation.py) contains an 
implementation for a RunInference pipeline that performs image segementation 
using the maskrcnn_resnet50_fpn architecture.
+[`pytorch_image_segmentation.py`](./pytorch_image_segmentation.py) contains an 
implementation for a RunInference pipeline that performs image segementation 
using the `maskrcnn_resnet50_fpn` architecture.
 
-The pipeline reads images, performs basic preprocessing, passes them to the 
PyTorch implementation of RunInference, and then writes the predictions to a 
text file.
+The pipeline reads images, performs basic preprocessing, passes the images to 
the PyTorch implementation of RunInference, and then writes predictions to a 
text file.
 
 ### Dataset and model for image segmentation
-You will need to create or download images, and place them into your 
`IMAGES_DIR` directory. Another popular dataset is from 
[Coco](https://cocodataset.org/#home). Please follow their instructions to 
download the images.
-- **Required**: A path to a file called `IMAGE_FILE_NAMES` that contains the 
absolute paths of each of the images in `IMAGES_DIR` on which you want to run 
image segmentation. Paths can be different types of URIs such as your local 
file system, a AWS S3 bucket or GCP Cloud Storage bucket. For example:
+
+To use this transform, you need a dataset and model for image segmentation. 

Review Comment:
   ```suggestion
   To use this transform, you need a dataset and model for image segmentation.
   ```



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -108,27 +110,31 @@ This writes the output to the `predictions.csv` with 
contents like:
 ---
 ## Image segmentation
 
-[`pytorch_image_segmentation.py`](./pytorch_image_segmentation.py) contains an 
implementation for a RunInference pipeline that performs image segementation 
using the maskrcnn_resnet50_fpn architecture.
+[`pytorch_image_segmentation.py`](./pytorch_image_segmentation.py) contains an 
implementation for a RunInference pipeline that performs image segementation 
using the `maskrcnn_resnet50_fpn` architecture.
 
-The pipeline reads images, performs basic preprocessing, passes them to the 
PyTorch implementation of RunInference, and then writes the predictions to a 
text file.
+The pipeline reads images, performs basic preprocessing, passes the images to 
the PyTorch implementation of RunInference, and then writes predictions to a 
text file.
 
 ### Dataset and model for image segmentation
-You will need to create or download images, and place them into your 
`IMAGES_DIR` directory. Another popular dataset is from 
[Coco](https://cocodataset.org/#home). Please follow their instructions to 
download the images.
-- **Required**: A path to a file called `IMAGE_FILE_NAMES` that contains the 
absolute paths of each of the images in `IMAGES_DIR` on which you want to run 
image segmentation. Paths can be different types of URIs such as your local 
file system, a AWS S3 bucket or GCP Cloud Storage bucket. For example:
+
+To use this transform, you need a dataset and model for image segmentation. 
+
+1. Create a directory named `IMAGES_DIR`. Create or download images and put 
them in this directory. The directory is not required if image names in the 
input file `IMAGE_FILE_NAMES` have absolute paths.
+A popular dataset is from [Coco](https://cocodataset.org/#home). Follow their 
instructions to download the images.
+2. Create a file named `IMAGE_FILE_NAMES` that contains the absolute paths of 
each of the images in `IMAGES_DIR` that you want to use to run image 
segmentation. The path to the file can be different types of URIs such as your 
local file system, an AWS S3 bucket, or a GCP Cloud Storage bucket. For example:
 ```
 /absolute/path/to/image1.jpg
 /absolute/path/to/image2.jpg
 ```
-- **Required**: A path to a file called `MODEL_STATE_DICT` that contains the 
saved parameters of the maskrcnn_resnet50_fpn model. You will need to download 
the [maskrcnn_resnet50_fpn](https://pytorch.org/vision/0.12/models.html#id70)
-model from Pytorch's repository of pretrained models. Note that this requires 
`torchvision` library.
+3. Download the 
[maskrcnn_resnet50_fpn](https://pytorch.org/vision/0.12/models.html#id70) model 
from Pytorch's repository of pretrained models. This model requires the 
torchvision library. To download this model, run the following commands:
 ```
 import torch
 from torchvision.models.detection import maskrcnn_resnet50_fpn
 model = maskrcnn_resnet50_fpn(pretrained=True)
 torch.save(model.state_dict(), 'maskrcnn_resnet50_fpn.pth')
 ```
-- **Required**: A path to a file called `OUTPUT`, to which the pipeline will 
write the predictions.
-- **Optional**: `IMAGES_DIR`, which is the path to the directory where images 
are stored. Not required if image names in the input file `IMAGE_FILE_NAMES` 
have absolute paths.
+4. Create a path to a file named `MODEL_STATE_DICT` that contains the saved 
parameters of the `maskrcnn_resnet50_fpn` model. 

Review Comment:
   ```suggestion
   4. Create a path to a file named `MODEL_STATE_DICT` that contains the saved 
parameters of the `maskrcnn_resnet50_fpn` model.
   ```



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -157,26 +163,30 @@ Each line has data separated by a semicolon ";". The 
first item is the file name
 ---
 ## Language modeling
 
-[`pytorch_language_modeling.py`](./pytorch_language_modeling.py) contains an 
implementation for a RunInference pipeline that performs masked language 
modeling (i.e. decoding a masked token in a sentence) using the BertForMaskedLM 
architecture from Hugging Face.
+[`pytorch_language_modeling.py`](./pytorch_language_modeling.py) contains an 
implementation for a RunInference pipeline that performs masked language 
modeling (that is, decoding a masked token in a sentence) using the 
`BertForMaskedLM` architecture from Hugging Face.
 
 The pipeline reads sentences, performs basic preprocessing to convert the last 
word into a `[MASK]` token, passes the masked sentence to the PyTorch 
implementation of RunInference, and then writes the predictions to a text file.
 
 ### Dataset and model for language modeling
 
-- **Required**: A path to a file called `MODEL_STATE_DICT` that contains the 
saved parameters of the BertForMaskedLM model. You will need to download the 
[BertForMaskedLM](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM)
 model from Hugging Face's repository of pretrained models. Make sure you have 
installed `transformers` too.
+To use this transform, you need a dataset and model for language modeling. 
+
+1. Download the 
[BertForMaskedLM](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM)
 model from Hugging Face's repository of pretrained models. You must already 
have `transformers` installed.
 ```
 import torch
 from transformers import BertForMaskedLM
 model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
 torch.save(model.state_dict(), 'BertForMaskedLM.pth')
 ```
-- **Required**: A path to a file called `OUTPUT`, to which the pipeline will 
write the predictions.
-- **Optional**: A path to a file called `SENTENCES` that contains sentences to 
feed into the model. It should look something like this:
+2. Create a file named `MODEL_STATE_DICT` that contains the saved parameters 
of the `BertForMaskedLM` model. 

Review Comment:
   ```suggestion
   2. Create a file named `MODEL_STATE_DICT` that contains the saved parameters 
of the `BertForMaskedLM` model.
   ```



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -218,16 +228,19 @@ is the word that the model predicts for the mask.
 The pipeline reads rows of pixels corresponding to a digit, performs basic 
preprocessing, passes the pixels to the Scikit-learn implementation of 
RunInference, and then writes the predictions to a text file.
 
 ### Dataset and model for language modeling
-- **Required**: A path to a file called `INPUT` that contains label and pixels 
to feed into the model. Each row should have elements that are comma-separated. 
The first element is the label. All subsuequent elements would be pixel values. 
It should look something like this:
+
+To use this transform, you need a dataset and model for language modeling. 

Review Comment:
   ```suggestion
   To use this transform, you need a dataset and model for language modeling.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] yeandy commented on a diff in pull request #22069: Reviewing the RunInference ReadMe file for clarity.

Reply via email to