This is an automated email from the ASF dual-hosted git repository.
damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 2e71061e7b9 ML notebook formatting and text updates (#24437)
2e71061e7b9 is described below
commit 2e71061e7b9d2383cbea5531215b68e6ec0236cd
Author: Rebecca Szper <[email protected]>
AuthorDate: Thu Dec 1 06:13:40 2022 -0800
ML notebook formatting and text updates (#24437)
* merged and resolved the conflict
* more copy edits to the ML notebooks
* merged and resolved the conflict
* more copy edits to the ML notebooks
* more copy edits to the ML notebooks
* more copy edits to the ML notebooks
* trying to remove a section that shouldn't have been added back in
* Update examples/notebooks/beam-ml/custom_remote_inference.ipynb
Co-authored-by: Danny McCormick <[email protected]>
* Update examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
Co-authored-by: Danny McCormick <[email protected]>
* review updates
Co-authored-by: Danny McCormick <[email protected]>
---
.../beam-ml/custom_remote_inference.ipynb | 50 +++++++------
.../beam-ml/dataframe_api_preprocessing.ipynb | 82 ++++++++++------------
.../notebooks/beam-ml/run_custom_inference.ipynb | 17 ++---
.../beam-ml/run_inference_multi_model.ipynb | 74 ++++++++++---------
.../notebooks/beam-ml/run_inference_pytorch.ipynb | 32 +++++----
.../run_inference_pytorch_tensorflow_sklearn.ipynb | 57 +++++++--------
.../notebooks/beam-ml/run_inference_sklearn.ipynb | 30 ++++----
.../beam-ml/run_inference_tensorflow.ipynb | 42 +++++++----
8 files changed, 197 insertions(+), 187 deletions(-)
diff --git a/examples/notebooks/beam-ml/custom_remote_inference.ipynb
b/examples/notebooks/beam-ml/custom_remote_inference.ipynb
index 036a9d39d4e..ad25849e89e 100644
--- a/examples/notebooks/beam-ml/custom_remote_inference.ipynb
+++ b/examples/notebooks/beam-ml/custom_remote_inference.ipynb
@@ -4,6 +4,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
+ "cellView": "form",
"id": "paYiulysGrwR"
},
"outputs": [],
@@ -36,15 +37,16 @@
"source": [
"# Remote inference in Apache Beam\n",
"\n",
+ "This example demonstrates how to implement a custom inference call in
Apache Beam using the Google Cloud Vision API.\n",
+ "\n",
"The prefered way to run inference in Apache Beam is by using the
[RunInference
API](https://beam.apache.org/documentation/sdks/python-machine-learning/). \n",
- "The RunInference API enables you to run your models as part of your
pipeline in a way that is optimized for machine learning inference. \n",
+ "The RunInference API enables you to run models as part of your
pipeline in a way that is optimized for machine learning inference. \n",
"To reduce the number of steps that you need to take, RunInference
supports features like batching. For more infomation about the RunInference
API, review the [RunInference
API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.html#apache_beam.ml.inference.RunInference),
\n",
"which demonstrates how to implement model inference in PyTorch,
scikit-learn, and TensorFlow.\n",
"\n",
"Currently, the RunInference API doesn't support making remote
inference calls using the Natural Language API, Cloud Vision API, and so on.
\n",
- "Therefore, to use these remote APIs with Apache Beam, you need to
write custom inference calls.\n",
- "\n",
- "This notebook shows how to implement a custom inference call in
Apache Beam. This example uses the Google Cloud Vision API."
+ "Therefore, to use these remote APIs with Apache Beam, you need to
write custom inference calls.\n"
+
]
},
{
@@ -53,7 +55,7 @@
"id": "GNbarEZsalS1"
},
"source": [
- "## Use case: run the Cloud Vision API\n",
+ "## Run the Cloud Vision API\n",
"\n",
"You can use the Cloud Vision API to retrieve labels that describe an
image.\n",
"For example, the following image shows a lion with possible labels."
@@ -75,20 +77,20 @@
},
"source": [
"We want to run the Google Cloud Vision API on a large set of images,
and Apache Beam is the ideal tool to handle this workflow.\n",
- "This example notebook demonstates how to retrieve image labels with
this API on a small set of images.\n",
+ "This example demonstates how to retrieve image labels with this API
on a small set of images.\n",
"\n",
- "The notebook follows these steps to implement this workflow:\n",
+ "The example follows these steps to implement this workflow:\n",
"* Read the images.\n",
"* Batch the images together to optimize the model call.\n",
"* Send the images to an external API to run inference.\n",
- "* Post-process the results of your API.\n",
+ "* Postprocess the results of your API.\n",
"\n",
"**Caution:** Be aware of API quotas and the heavy load you might
incur on your external API. Verify that your pipeline and API are configured
correctly for your use case.\n",
"\n",
"To optimize the calls to the external API, limit the parallel calls
to the external remote API by configuring
[PipelineOptions](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options).\n",
"In Apache Beam, different runners provide options to handle the
parallelism, for example:\n",
- "* With the [Direct
Runner](https://beam.apache.org/documentation/runners/direct/), use
`direct_num_workers`.\n",
- "* With the [Google Cloud Dataflow
Runner](https://beam.apache.org/documentation/runners/dataflow/), use
`max_num_workers`.\n",
+ "* With the [Direct
Runner](https://beam.apache.org/documentation/runners/direct/), use the
`direct_num_workers` pipeline option.\n",
+ "* With the [Google Cloud Dataflow
Runner](https://beam.apache.org/documentation/runners/dataflow/), use the
`max_num_workers` pipeline option.\n",
"\n",
"For information about other runners, see the [Beam capability
matrix](https://beam.apache.org/documentation/runners/capability-matrix/) "
]
@@ -99,7 +101,7 @@
"id": "FAawWOaiIYaS"
},
"source": [
- "## Installation\n",
+ "## Before you begin\n",
"\n",
"This section provides installation steps."
]
@@ -170,9 +172,11 @@
"id": "mL4MaHm_XOVd"
},
"source": [
- "## Remote inference on Cloud Vision API\n",
+ "## Run remote inference on Cloud Vision API\n",
+ "\n",
+ "This section demonstates the steps to run remote inference on the
Cloud Vision API.\n",
"\n",
- "This section demonstates the steps to run remote inference on the
Cloud Vision API."
+ "Download and install Apache Beam and the required modules."
]
},
{
@@ -199,7 +203,7 @@
"id": "09k08IYlLmON"
},
"source": [
- "For this example, we use images from the [MSCoco
dataset](https://cocodataset.org/#explore) as a list of image urls.\n",
+ "This example uses images from the [MSCoco
dataset](https://cocodataset.org/#explore) as a list of image URLs.\n",
"This data is used as the pipeline input."
]
},
@@ -234,20 +238,20 @@
"id": "HLy7VKJhLrmT"
},
"source": [
- "### Custom DoFn\n",
+ "### Create a custom DoFn\n",
"\n",
"In order to implement remote inference, create a DoFn class. This
class sends a batch of images to the Cloud vision API.\n",
"\n",
"The custom DoFn makes it possible to initialize the API. In case of a
custom model, a model can also be loaded in the `setup` function. \n",
"\n",
- "The `process` function is the most interesting part. In this function
we implement the model call and return its results.\n",
+ "The `process` function is the most interesting part. In this
function, we implement the model call and return its results.\n",
"\n",
- "**Caution:** When running remote inference, prepare to encounter,
identify, and handle failure as gracefully as possible. We recommend using the
following techniques: \n",
+ "When running remote inference, prepare to encounter, identify, and
handle failure as gracefully as possible. We recommend using the following
techniques: \n",
"\n",
"* **Exponential backoff:** Retry failed remote calls with
exponentially growing pauses between retries. Using exponential backoff ensures
that failures don't lead to an overwhelming number of retries in quick
succession. \n",
"\n",
- "* **Dead letter queues:** Route failed inferences to a separate
`PCollection` without failing the whole transform. You can continue execution
without failing the job (batch jobs' default behavior) or retrying indefinitely
(streaming jobs' default behavior).\n",
- "You can then run custom pipeline logic on the deadletter queue to log
the failure, alert, and push the failed message to temporary storage so that it
can eventually be reprocessed. "
+ "* **Dead-letter queues:** Route failed inferences to a separate
`PCollection` without failing the whole transform. You can continue execution
without failing the job (batch jobs' default behavior) or retrying indefinitely
(streaming jobs' default behavior).\n",
+ "You can then run custom pipeline logic on the dead-letter queue
(unprocessed messages queue) to log the failure, alert, and push the failed
message to temporary storage so that it can eventually be reprocessed."
]
},
{
@@ -277,7 +281,7 @@
" image_requests = [vision.AnnotateImageRequest(image=image,
features=[feature]) for image in images]\n",
" batch_image_request =
vision.BatchAnnotateImagesRequest(requests=image_requests)\n",
"\n",
- " # Send batch request to the remote endpoint.\n",
+ " # Send the batch request to the remote endpoint.\n",
" responses =
self._client.batch_annotate_images(request=batch_image_request).responses\n",
" \n",
" return list(zip(image_urls, responses))\n"
@@ -289,7 +293,7 @@
"id": "lHJuyHhvL0-a"
},
"source": [
- "### Batching\n",
+ "### Manage batching\n",
"\n",
"Before we can chain together the pipeline steps, we need to
understand batching.\n",
"When running inference with your model, either in Apache Beam or in
an external API, you can batch your input to increase the efficiency of the
model execution.\n",
@@ -297,7 +301,7 @@
"\n",
"To manage the batching in this pipeline, include a `BatchElements`
transform to group elements together and form a batch of the desired size.\n",
"\n",
- "* If you have a streaming pipeline, consider using
[GroupIntoBatches](https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/)\n",
+ "* If you have a streaming pipeline, consider using
[GroupIntoBatches](https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/),\n",
"because `BatchElements` doesn't batch items across bundles.
`GroupIntoBatches` requires choosing a key within which items are batched.\n",
"\n",
"* When batching, make sure that the input batch matches the maximum
payload of the external API. \n",
@@ -619,7 +623,7 @@
"id": "7gwn5bF1XaDm"
},
"source": [
- "### Metrics\n",
+ "## Monitor the pipeline\n",
"\n",
"Because monitoring can provide insight into the status and health of
the application, consider monitoring and measuring pipeline performance.\n",
"For information about the available tracking metrics, see
[RunInference
Metrics](https://beam.apache.org/documentation/ml/runinference-metrics/)."
diff --git a/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
b/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
index 645d62d32be..e45f1bd2d39 100644
--- a/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
+++ b/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
@@ -38,29 +38,23 @@
"\n",
"For rapid execution, Pandas loads all of the data into memory on a
single machine (one node). This configuration works well when dealing with
small-scale datasets. However, many projects involve datasets that are too big
to fit in memory. These use cases generally require parallel data processing
frameworks, such as Apache Beam.\n",
"\n",
- "\n",
- "## Apache Beam DataFrames\n",
- "\n",
- "\n",
- "Beam DataFrames provide a pandas-like\n",
+ "Beam DataFrames provide a Pandas-like\n",
"API to declare and define Beam processing pipelines. It provides a
familiar interface for machine learning practioners to build complex
data-processing pipelines by only invoking standard pandas commands.\n",
"\n",
"To learn more about Apache Beam DataFrames, see the\n",
"[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
"\n",
- "## Goal\n",
- "The goal of this notebook is to explore a dataset preprocessed with
the Beam DataFrame API for machine learning model training.\n",
+ "## Overview\n",
+ "The goal of this example is to explore a dataset preprocessed with
the Beam DataFrame API for machine learning model training.\n",
"\n",
- "\n",
- "## Tutorial outline\n",
- "\n",
- "This notebook demonstrates the use of the Apache Beam DataFrames API
to perform common data exploration as well as the preprocessing steps that are
necessary to prepare your dataset for machine learning model training and
inference. These steps include the following: \n",
+ "This example demonstrates the use of the Apache Beam DataFrames API
to perform common data exploration as well as the preprocessing steps that are
necessary to prepare your dataset for machine learning model training and
inference. This example includes the following steps: \n",
"\n",
"* Removing unwanted columns.\n",
"* One-hot encoding categorical columns.\n",
"* Normalizing numerical columns.\n",
"\n",
- "\n"
+ "In this example, the first section demonstrates how to build and
execute a pipeline locally using the interactive runner.\n",
+ "The second section uses a distributed runner to demonstrate how to
run the pipeline on the full dataset.\n"
],
"metadata": {
"id": "iFZC1inKuUCy"
@@ -69,9 +63,9 @@
{
"cell_type": "markdown",
"source": [
- "## Installation\n",
+ "## Install Apache Beam\n",
"\n",
- "To explore the elements within a `PCollection`, install Apache Beam
with the `interactive` component to use the Interactive runner. The latest
implemented DataFrames API methods invoked in this notebook are available in
Apache Beam SDK versions 2.43 and later.\n"
+ "To explore the elements within a `PCollection`, install Apache Beam
with the `interactive` component to use the Interactive runner. The DataFrames
API methods invoked in this example are available in Apache Beam SDK versions
2.43 and later.\n"
],
"metadata": {
"id": "A0f2HJ22D4lt"
@@ -105,8 +99,8 @@
{
"cell_type": "markdown",
"source": [
- "## Part I : Local exploration with the Interactive Beam runner\n",
- "Start by using the [Interactive
Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html)
to explore and develop your pipeline.\n",
+ "## Local exploration with the Interactive Beam runner\n",
+ "Use the [Interactive
Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html)
runner to explore and develop your pipeline.\n",
"This runner allows you to test the code interactively, progressively
building out the pipeline before deploying it on a distributed runner. \n",
"\n",
"\n",
@@ -124,12 +118,12 @@
"source": [
"### Load the data\n",
"\n",
- "To read CSV files into Dataframes, Pandas has the\n",
+ "To read CSV files into DataFrames, Pandas has the\n",
"[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
"function.\n",
"This notebook uses the Beam\n",
"[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
- "function, which emulates `pandas.read_csv`. The main difference is
that the Beam function returns a deferred Beam DataFrame whereas the Pandas
function returns a standard DataFrame.\n"
+ "function, which emulates `pandas.read_csv`. The main difference is
that the Beam function returns a deferred Beam DataFrame, whereas the Pandas
function returns a standard DataFrame.\n"
]
},
{
@@ -170,8 +164,8 @@
"### Preprocess the data\n",
"\n",
"This example uses the [NASA - Nearest Earth Objects
dataset](https://cneos.jpl.nasa.gov/ca/).\n",
- "This dataset includes information about objects in the outer space.
Some objects are close enough to Earth to cause harm.\n",
- "Therefore, this dataset compiles the list of NASA certified asteroids
that are classified as the nearest earth objects to understand which objects
pose a risk."
+ "This dataset includes information about objects in outer space. Some
objects are close enough to Earth to cause harm.\n",
+ "This dataset compiles the list of NASA certified asteroids that are
classified as the nearest earth objects to understand which objects pose a
risk."
]
},
{
@@ -673,7 +667,7 @@
{
"cell_type": "markdown",
"source": [
- "Use the standard pandas command `DataFrame.describe()` to generate
descriptive statistics for the numerical columns like percentile, mean, std,
and so on. "
+ "Use the standard pandas command `DataFrame.describe()` to generate
descriptive statistics for the numerical columns, such as percentile, mean,
std, and so on. "
],
"metadata": {
"id": "MGAErO0lAYws"
@@ -1006,16 +1000,16 @@
"source": [
"Before running any transformations, verify that all of the columns
need to be used for model training. Start by looking at the column description
provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
"\n",
- "* **spk_id:** Object primary SPK-ID\n",
- "* **full_name:** Asteroid name\n",
- "* **near_earth_object:** Near-earth object flag\n",
+ "* **spk_id:** Object primary SPK-ID.\n",
+ "* **full_name:** Asteroid name.\n",
+ "* **near_earth_object:** Near-earth object flag.\n",
"* **absolute_magnitude:** The apparent magnitude an object would have
if it were located at a distance of 10 parsecs.\n",
"* **diameter:** Object diameter (from equivalent sphere) km unit.\n",
- "* **albedo:** A measure of the diffuse reflection of solar radiation
out of the total solar radiation and measured on a scale from 0 to 1.\n",
+ "* **albedo:** A measure of the diffuse reflection of solar radiation
out of the total solar radiation, measured on a scale from 0 to 1.\n",
"* **diameter_sigma:** 1-sigma uncertainty in object diameter km
unit.\n",
- "* **eccentricity:** A value between 0 and 1 that refers to how flat
or round the asteroid is \n",
- "* **inclination:** The angle with respect to the x-y ecliptic
plane\n",
- "* **moid_ld:** Earth Minimum Orbit Intersection Distance au unit\n",
+ "* **eccentricity:** A value between 0 and 1 that refers to how flat
or round the asteroid is.\n",
+ "* **inclination:** The angle with respect to the x-y ecliptic
plane.\n",
+ "* **moid_ld:** Earth Minimum Orbit Intersection Distance au unit.\n",
"* **object_class:** The classification of the asteroid. For a more
detailed description, see [NASA object
classifications](https://pdssbn.astro.umd.edu/data_other/objclass.shtml).\n",
"* **Semi-major axis au Unit:** The length of half of the long axis in
AU unit.\n",
"* **hazardous_flag:** Identifies hazardous asteroids."
@@ -1027,7 +1021,7 @@
"id": "DzYVKbwTp72d"
},
"source": [
- "The **'spk_id'** and **'full_name'** columns are unique for each row.
You can remove these columns, because they are not needed for model training."
+ "The **spk_id** and **full_name** columns are unique for each row. You
can remove these columns, because they are not needed for model training."
]
},
{
@@ -1153,7 +1147,7 @@
"id": "00MRdFGLwQiD"
},
"source": [
- "Most of the columns do not have missing values. However, the columns
**'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values.
Because these values cannot be measured or derived and aren't needed for
training the ML model, remove the columns."
+ "Most of the columns do not have missing values. However, the columns
**diameter**, **albedo**, and **diameter_sigma** have many missing values.
Because these values cannot be measured or derived and aren't needed for
training the ML model, remove the columns."
]
},
{
@@ -1511,7 +1505,7 @@
"id": "a3PojL3WBqgE"
},
"source": [
- "Next, normalize the numerical columns so that they can be used to
train a model. To standarize the data, you can subtract the mean and divide by
the standard deviation. This process is also known as finding the
[z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score).\n",
+ "Normalize the numerical columns so that they can be used to train a
model. To standarize the data, you can subtract the mean and divide by the
standard deviation. This process is also known as finding the
[z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score).\n",
"This step improves the performance and training stability of the
model during training and inference.\n"
]
},
@@ -1859,7 +1853,7 @@
"id": "qdNILsajFvex"
},
"source": [
- "Convert the categorical columns into one-hot encoded variables to use
them during training.\n"
+ "Next, convert the categorical columns into one-hot encoded variables
to use during training.\n"
]
},
{
@@ -2596,7 +2590,7 @@
"\n",
"This section combines the previous steps into a full pipeline
implementation, and then visualizes the preprocessed data.\n",
"\n",
- "Note that the only standard Apache Beam method invoked here is the
`pipeline` instance. The rest of the preprocessing commands are based on native
Pandas methods that are integrated with the Apache Beam DataFrame API."
+ "Note that the only standard Apache Beam method invoked here is the
`pipeline` instance. The rest of the preprocessing commands are based on native
pandas methods that are integrated with the Apache Beam DataFrame API."
]
},
{
@@ -3339,7 +3333,7 @@
"id": "xZvJTqa3XKI_"
},
"source": [
- "## Part II : Process the full dataset with the distributed runner\n",
+ "## Process the full dataset with the distributed runner\n",
"The previous section demonstrates how to build and execute the
pipeline locally using the interactive runner.\n",
"This section demonstrates how to run the pipeline on the full dataset
by switching to a distributed runner. For this example, the pipeline runs on
[Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
]
@@ -3361,7 +3355,7 @@
{
"cell_type": "markdown",
"source": [
- "These steps process the full dataset, `full.csv`, which contains
approximately one million rows. To materialize the deferred dataframe, these
steps also write the results to a CSV file instead of using `ib.collect()`.\n",
+ "These steps process the full dataset, `full.csv`, which contains
approximately one million rows. To materialize the deferred DataFrame, these
steps also write the results to a CSV file instead of using `ib.collect()`.\n",
"\n",
"To switch from an interactive runner to a distributed runner, update
the pipeline options. The rest of the pipeline steps don't change."
],
@@ -3450,12 +3444,10 @@
"\n",
"This tutorial demonstrated how to analyze and preprocess a
large-scale dataset with the Apache Beam DataFrames API. You can now train a
model on a classification task using the preprocessed dataset.\n",
"\n",
- "To learn more about how to get started with classifying structured
data, see:\n",
- "\n",
- "* [Structred data classification from
scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)\n",
+ "To learn more about how to get started with classifying structured
data, see \n",
+ "[Structured data classification from
scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/).\n",
"\n",
- "To continue learning, find another dataset to use with the Apache
Beam DataFrames API processing. Think carefully about which features to include
in your model and how to represent them.\n",
- "\n"
+ "To continue learning, find another dataset to use with the Apache
Beam DataFrames API processing. Think carefully about which features to include
in your model and how to represent them.\n"
],
"metadata": {
"id": "UOLr6YgOOSVQ"
@@ -3466,11 +3458,11 @@
"source": [
"## Resources\n",
"\n",
- "* [Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview) -- An
overview of the Apache Beam DataFrames API.\n",
- "* [Differences from
pandas](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas)
-- Reviews the differences between Apache Beam DataFrames and Pandas
DataFrames, as well as some of the workarounds for unsupported operations.\n",
- "* [10 minutes to
Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) --
A quickstart guide to the Pandas DataFrames.\n",
- "* [Pandas DataFrame
API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) -- The
API reference for the Pandas DataFrames.\n",
- "* [Data preparation and feature training in
ML](https://developers.google.com/machine-learning/data-prep) -- A guideline
about data transformation for ML training."
+ "* [Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview) - An
overview of the Apache Beam DataFrames API.\n",
+ "* [Differences from
pandas](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas)
- Reviews the differences between Apache Beam DataFrames and Pandas
DataFrames, as well as some of the workarounds for unsupported operations.\n",
+ "* [10 minutes to
Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) - A
quickstart guide to the Pandas DataFrames.\n",
+ "* [Pandas DataFrame
API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) - The
API reference for the Pandas DataFrames.\n",
+ "* [Data preparation and feature training in
ML](https://developers.google.com/machine-learning/data-prep) - A guideline
about data transformation for ML training."
],
"metadata": {
"id": "nG9WXXVcMCe_"
diff --git a/examples/notebooks/beam-ml/run_custom_inference.ipynb
b/examples/notebooks/beam-ml/run_custom_inference.ipynb
index 9d57bf9f475..c45405204d2 100644
--- a/examples/notebooks/beam-ml/run_custom_inference.ipynb
+++ b/examples/notebooks/beam-ml/run_custom_inference.ipynb
@@ -5,6 +5,7 @@
"execution_count": 1,
"id": "C1rAsD2L-hSO",
"metadata": {
+ "cellView": "form",
"id": "C1rAsD2L-hSO"
},
"outputs": [],
@@ -41,9 +42,10 @@
"This notebook demonstrates how to run inference on your custom
framework using the\n",
"[ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler)
class.\n",
"\n",
- "Named-Entity Recognition (NER) is one of the most common tasks for
natural language processing (NLP). \n",
- "NLP locates and named entities in unstructured text and classifies
the entities using pre-defined labels, such as person name, organization, date,
and so on.\n",
- "This example illustrates how to use the popular `spaCy` package to
load an ML model and perform inference in an Apache Beam pipeline using the
RunInference `PTransform`.\n",
+ "Named-entity recognition (NER) is one of the most common tasks for
natural language processing (NLP). \n",
+ "NLP locates named entities in unstructured text and classifies the
entities using pre-defined labels, such as person name, organization, date, and
so on.\n",
+ "\n",
+ "This example illustrates how to use the popular `spaCy` package to
load a machine learning (ML) model and perform inference in an Apache Beam
pipeline using the RunInference `PTransform`.\n",
"For more information about the RunInference API, see [Machine
Learning](https://beam.apache.org/documentation/sdks/python-machine-learning)
in the Apache Beam documentation."
]
},
@@ -58,7 +60,7 @@
"\n",
"The RunInference library is available in Apache Beam versions 2.40
and later.\n",
"\n",
- "For this example, you need to install `spaCy` and `pandas`. A small
NER model (`en_core_web_sm`) is also installed, but you can use any valid
`spaCy` model."
+ "For this example, you need to install `spaCy` and `pandas`. A small
NER model, `en_core_web_sm`, is also installed, but you can use any valid
`spaCy` model."
]
},
{
@@ -84,7 +86,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Learn more about `spaCy`\n",
+ "## Learn about `spaCy`\n",
"\n",
"To learn more about `spaCy`, create a `spaCy` language object in
memory using `spaCy`'s trained models.\n",
"You can install these models as Python packages.\n",
@@ -242,9 +244,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Create a`ModelHandler` to use `spaCy` for inference\n",
+ "## Create a model handler\n",
"\n",
- "This section demonstrates how to create your own `ModelHandler`."
+ "This section demonstrates how to create your own `ModelHandler` so
that you can use `spaCy` for inference."
]
},
{
@@ -420,7 +422,6 @@
" | \"CreateSentences\" >> beam.Create(text_strings_with_keys)\n",
" | \"RunInferenceSpacy\" >>
RunInference(keyed_spacy_model_handler)\n",
" # Generate a schema suitable for conversion to a dataframe using
Map to Row objects.\n",
- " # to a dataframe.\n",
" | 'ToRows' >> beam.Map(lambda row: beam.Row(key=row[0],
text=row[1][0], predictions=row[1][1]))\n",
" )"
]
diff --git a/examples/notebooks/beam-ml/run_inference_multi_model.ipynb
b/examples/notebooks/beam-ml/run_inference_multi_model.ipynb
index a1e52b23546..cabe60e7a3a 100644
--- a/examples/notebooks/beam-ml/run_inference_multi_model.ipynb
+++ b/examples/notebooks/beam-ml/run_inference_multi_model.ipynb
@@ -71,7 +71,7 @@
{
"cell_type": "markdown",
"source": [
- "## Use case: Image captioning with cascade models "
+ "## Image captioning with cascade models"
],
"metadata": {
"id": "i1uyzlj3s3e_"
@@ -80,12 +80,12 @@
{
"cell_type": "markdown",
"source": [
- "Image captioning has various applications, such as image indexing for
information retreival, virtual assistant training, and various natural language
processing applications.\n",
+ "Image captioning has various applications, such as image indexing for
information retrieval, virtual assistant training, and natural language
processing.\n",
"\n",
"This example shows how to generate captions on a a large set of
images. Apache Beam is the ideal tool to handle this workflow. We use two
models for this task:\n",
"\n",
- "* [BLIP](https://github.com/salesforce/BLIP): Used to generate a set
of candidate captions for a given image. \n",
- "* [CLIP](https://github.com/openai/CLIP): Used to rank the generated
captions based on accuracy."
+ "* [BLIP](https://github.com/salesforce/BLIP): Generates a set of
candidate captions for a given image. \n",
+ "* [CLIP](https://github.com/openai/CLIP): Ranks the generated
captions based on accuracy."
],
"metadata": {
"id": "cP1sBhNacS8b"
@@ -106,14 +106,14 @@
"The steps to build this pipeline are as follows:\n",
"* Read the images.\n",
"* Preprocess the images for caption generation for inference with the
BLIP model.\n",
- "* Inference with BLIP to generate a list of caption candidates.\n",
+ "* Run inference with BLIP to generate a list of caption
candidates.\n",
"* Aggregate the generated captions with their source image.\n",
- "* Preprocess the aggregated image-caption pair to rank them with
CLIP.\n",
- "* Inference with CLIP to generate the caption ranking. \n",
+ "* Preprocess the aggregated image-caption pairs to rank them with
CLIP.\n",
+ "* Run inference with CLIP to generate the caption ranking. \n",
"* Print the image names and the captions sorted according to their
ranking.\n",
"\n",
"\n",
- "The following diagram illustrates the steps in the inference
pipelines used in this notebook:"
+ "The following diagram illustrates the steps in the inference
pipelines used in this notebook."
],
"metadata": {
"id": "lBPfy-bYgLuD"
@@ -284,7 +284,7 @@
{
"cell_type": "markdown",
"source": [
- "### CLIP\n",
+ "### Install CLIP dependencies\n",
"\n",
"Download and install the CLIP dependencies."
],
@@ -343,7 +343,7 @@
{
"cell_type": "markdown",
"source": [
- "### BLIP\n",
+ "### Install BLIP dependencies\n",
"\n",
"Download and install the BLIP dependencies."
],
@@ -417,7 +417,7 @@
{
"cell_type": "markdown",
"source": [
- "### I/O helper functions\n",
+ "### Install I/O helper functions\n",
"\n",
"Download and install the dependencies for the I/O helper functions."
],
@@ -430,7 +430,7 @@
"source": [
"class ReadImagesFromUrl(beam.DoFn):\n",
" \"\"\"\n",
- " Read an image from a given url and return a tuple of the
images_url\n",
+ " Read an image from a given URL and return a tuple of the
images_url\n",
" and image data.\n",
" \"\"\"\n",
" def process(self, element: str) -> Tuple[str, Image.Image]:\n",
@@ -441,7 +441,7 @@
"\n",
"class FormatCaptions(beam.DoFn):\n",
" \"\"\"\n",
- " Print the image name and it's most relevant captions after CLIP
ranking.\n",
+ " Print the image name and its most relevant captions after CLIP
ranking.\n",
" \"\"\"\n",
" def __init__(self, number_of_top_captions: int):\n",
" self._number_of_top_captions = number_of_top_captions\n",
@@ -474,10 +474,10 @@
{
"cell_type": "markdown",
"source": [
- "Define the preprocessing and postprocessing function for each of the
models.\n",
+ "Define the preprocessing and postprocessing functions for each of the
models.\n",
"\n",
"To prepare the instance for processing bundles of elements by
initializing and to cache the processing transform resources, use
`DoFn.setup()`.\n",
- "This step avoids unnecessary re-initializations on every invocation
to the processing method."
+ "This step avoids unnecessary re-initializations on every invocation
of the processing method."
],
"metadata": {
"id": "wEViP715fes4"
@@ -486,8 +486,8 @@
{
"cell_type": "markdown",
"source": [
- "### BLIP\n",
- "Define the preprocessing and postprocessing function for BLIP."
+ "### Define BLIP functions\n",
+ "Define the preprocessing and postprocessing functions for BLIP."
],
"metadata": {
"id": "X1UGv6bbyNxY"
@@ -499,7 +499,7 @@
"class PreprocessBLIPInput(beam.DoFn):\n",
"\n",
" \"\"\"\n",
- " Process the raw image input to a format suitable for BLIP
Inference. The processed\n",
+ " Process the raw image input to a format suitable for BLIP
inference. The processed\n",
" images are duplicated to the number of desired captions per image.
\n",
"\n",
" Preprocessing transformation taken from: \n",
@@ -520,7 +520,7 @@
"\n",
" def process(self, element):\n",
" image_url, image = element \n",
- " # Update this step when this ticket is resolved:
https://github.com/apache/beam/issues/21863\n",
+ " # The following lines provide a workaround to turn off
BatchElements.\n",
" preprocessed_img = self._transform(image).unsqueeze(0)\n",
" preprocessed_img =
preprocessed_img.repeat(self._captions_per_image, 1, 1, 1)\n",
" # Parse the processed input to a dictionary to a format suitable
for RunInference.\n",
@@ -546,9 +546,9 @@
{
"cell_type": "markdown",
"source": [
- "### CLIP \n",
+ "### Define CLIP functions \n",
"\n",
- "Define the preprocessing and postprocessing function for CLIP."
+ "Define the preprocessing and postprocessing functions for CLIP."
],
"metadata": {
"id": "EZHfa1KzWWDI"
@@ -642,8 +642,12 @@
{
"cell_type": "markdown",
"source": [
- "Note that we use a `KeyedModelHandler` for both models to attach a
key to the general `ModelHandler`.\n",
- "The key is used to keep a reference to the image that the inference
is associated with and is used in the postprocessing steps.\n",
+ "Use a `KeyedModelHandler` for both models to attach a key to the
general `ModelHandler`.\n",
+ "The key is used for the following purposes:\n",
+ "* To keep a reference to the image that the inference is associated
with.\n",
+ "* To aggregate transforms of different inputs.\n",
+ "* To run postprocessing steps correctly.\n",
+ "\n",
"In this example, we use the `image_url` as the key."
],
"metadata": {
@@ -655,13 +659,13 @@
"source": [
"class
PytorchNoBatchModelHandlerKeyedTensor(PytorchModelHandlerKeyedTensor):\n",
" \"\"\"Wrapper to PytorchModelHandler to limit batch size to
1.\n",
- " The caption strings generated from BLIP tokenizer may have
different\n",
- " lengths, which doesn't work with torch.stack() in current
RunInference\n",
- " implementation since stack() requires tensors to be the same
size.\n",
+ " The caption strings generated from the BLIP tokenizer might have
different\n",
+ " lengths. Different length strings don't work with torch.stack()
in the current RunInference\n",
+ " implementation, because stack() requires tensors to be the same
size.\n",
" Restricting max_batch_size to 1 means there is only 1 example per
`batch`\n",
" in the run_inference() call.\n",
" \"\"\"\n",
- " # Update this step when this ticket is resolved:
https://github.com/apache/beam/issues/21863\n",
+ " # The following lines provide a workaround to turn off
BatchElements.\n",
" def batch_elements_kwargs(self):\n",
" return {'max_batch_size': 1}"
],
@@ -683,7 +687,7 @@
{
"cell_type": "markdown",
"source": [
- "## BLIP\n",
+ "## Generate captions with BLIP\n",
"\n",
"Use BLIP to generate a set of candidate captions for a given image."
],
@@ -711,7 +715,7 @@
"source": [
"class BLIPWrapper(torch.nn.Module):\n",
" \"\"\"\n",
- " Wrapper around the BLIP model to overwrite the default \"forward\"
method with the \"generate\" since BLIP uses the \n",
+ " Wrapper around the BLIP model to overwrite the default \"forward\"
method with the \"generate\" method, because BLIP uses the \n",
" \"generate\" method to produce the image captions.\n",
" \"\"\"\n",
" \n",
@@ -725,7 +729,7 @@
"\n",
" def forward(self, inputs: torch.Tensor):\n",
" # Squeeze because RunInference adds an extra dimension, which is
empty.\n",
- " # Update this step when this ticket is resolved:
https://github.com/apache/beam/issues/21863\n",
+ " # The following lines provide a workaround to turn off
BatchElements.\n",
" inputs = inputs.squeeze(0)\n",
" captions = self._model.generate(inputs,\n",
" sample=True,\n",
@@ -756,7 +760,7 @@
{
"cell_type": "markdown",
"source": [
- "## CLIP\n",
+ "## Rank captions with CLIP\n",
"\n",
"Use CLIP to rank the generated captions based on the accuracy with
which they represent the image."
],
@@ -771,7 +775,7 @@
"\n",
" def forward(self, **kwargs: Dict[str, torch.Tensor]):\n",
" # Squeeze because RunInference adds an extra dimension, which is
empty.\n",
- " # Update this step when this ticket is resolved:
https://github.com/apache/beam/issues/21863.\n",
+ " # The following lines provide a workaround to turn off
BatchElements.\n",
" kwargs = {key: tensor.squeeze(0) for key, tensor in
kwargs.items()}\n",
" output = super().forward(**kwargs)\n",
" logits = output.logits_per_image\n",
@@ -888,7 +892,7 @@
{
"cell_type": "markdown",
"source": [
- "## Initialize pipeline run parameters\n",
+ "## Initialize the pipeline run parameters\n",
"\n",
"Specify the number of captions generated per image and the number of
captions to display with each image."
],
@@ -914,7 +918,7 @@
{
"cell_type": "markdown",
"source": [
- "## Run pipeline"
+ "## Run the pipeline"
],
"metadata": {
"id": "5T9Pcdp7oNb8"
@@ -923,7 +927,7 @@
{
"cell_type": "markdown",
"source": [
- "This example uses raw images from the `read_images` pipeline as
inputs for both models, because each model needs to preprocess the raw images
differently. They require a different embedding representation for image
captioning and image-captions pair ranking.\n",
+ "This example uses raw images from the `read_images` pipeline as
inputs for both models. Each model needs to preprocess the raw images
differently, because they require a different embedding representation for
image captioning and for image-captions pair ranking.\n",
"\n",
"To aggregate the raw images with the generated caption by their key
(the image URL), this example uses `CoGroupByKey`. This process produces a
tuple of image-captions pairs that is then passed to the CLIP transform and
used for ranking."
],
diff --git a/examples/notebooks/beam-ml/run_inference_pytorch.ipynb
b/examples/notebooks/beam-ml/run_inference_pytorch.ipynb
index 3afc6bad989..d0a350982f4 100644
--- a/examples/notebooks/beam-ml/run_inference_pytorch.ipynb
+++ b/examples/notebooks/beam-ml/run_inference_pytorch.ipynb
@@ -54,7 +54,7 @@
"This notebook demonstrates the use of the RunInference transform for
PyTorch. Apache Beam includes implementations of the
[ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler)
class for [users of
PyTorch](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.pytorch_inference.html).
For more information about the RunInference API, see [Machine
Learning](https://beam.apache.or [...]
"\n",
"\n",
- "This notebook illustrates common RunInference patterns,such as:\n",
+ "This notebook illustrates common RunInference patterns, such as:\n",
"* Using a database with RunInference.\n",
"* Postprocessing results after using RunInference.\n",
"* Inference with multiple models in the same pipeline.\n",
@@ -71,7 +71,7 @@
"source": [
"## Dependencies\n",
"\n",
- "The RunInference library is available in Apache Beam versions
<b>2.40</b> and later.\n",
+ "The RunInference library is available in Apache Beam versions 2.40
and later.\n",
"\n",
"To use Pytorch RunInference API, you need to install the PyTorch
module. To install PyTorch, use `pip`:"
]
@@ -235,7 +235,7 @@
},
"source": [
"### Train the linear regression mode on 5 times data\n",
- "Use the following to train your linear regression model on the 5
times table."
+ "Use the following code to train your linear regression model on the 5
times table."
]
},
{
@@ -270,7 +270,7 @@
"id": "bd106b29-6187-42c1-9743-1666c147b5e3"
},
"source": [
- "Save the model using `torch.save()` and then confirm that the saved
model file exists."
+ "Save the model using `torch.save()`, and then confirm that the saved
model file exists."
]
},
{
@@ -304,6 +304,7 @@
},
"source": [
"### Prepare train and test data for a 10 times model\n",
+ "This example model is a 10 times table.\n",
"* `x` contains values in the range from 0 to 99.\n",
"* `y` is a list of 10 * `x`. "
]
@@ -404,7 +405,7 @@
"source": [
"### Use RunInference within the pipeline\n",
"\n",
- "1. Create a PyTorch model handler object by passing required
arguments such as `state_dict_path`, `model_class`, `model_params` to the
`PytorchModelHandlerTensor` class.\n",
+ "1. Create a PyTorch model handler object by passing required
arguments such as `state_dict_path`, `model_class`, and `model_params` to the
`PytorchModelHandlerTensor` class.\n",
"2. Pass the `PytorchModelHandlerTensor` object to the RunInference
transform to perform predictions on unkeyed data."
]
},
@@ -455,8 +456,8 @@
"id": "9d95e69b-203f-4abb-9abb-360bdf4d769a"
},
"source": [
- "## Pattern 2: Post-process RunInference results.\n",
- "This pattern demonstrates how to post-process the RunInference
results.\n",
+ "## Pattern 2: Postprocess RunInference results\n",
+ "This pattern demonstrates how to postprocess the RunInference
results.\n",
"\n",
"Add a `PredictionProcessor` to the pipeline after `RunInference`.
`PredictionProcessor` processes the output of the `RunInference` transform."
]
@@ -529,11 +530,11 @@
"\n",
"Modify the pipeline to read from sources like CSV files and
BigQuery.\n",
"\n",
- "In this step we do the following:\n",
+ "In this step, you take the following actions:\n",
"\n",
"* To handle keyed data, wrap the `PytorchModelHandlerTensor` object
around `KeyedModelHandler`.\n",
"* Add a map transform that converts a table row into `Tuple[str,
float]`.\n",
- "* Add a map transform that converts `Tuple[str, float]` from to
`Tuple[str, torch.Tensor]`.\n",
+ "* Add a map transform that converts `Tuple[str, float]` to
`Tuple[str, torch.Tensor]`.\n",
"* Modify the post-inference processor to output results with the key."
]
},
@@ -564,7 +565,8 @@
"id": "f22da313-5bf8-4334-865b-bbfafc374e63"
},
"source": [
- "### Create a source with attached key\n"
+ "### Create a source with attached key\n",
+ "This section shows how to create either a BigQuery or a CSV source
with an attached key."
]
},
{
@@ -573,7 +575,8 @@
"id": "c9b0fb49-d605-4f26-931a-57f42b0ad253"
},
"source": [
- "#### Use BigQuery as the source"
+ "#### Use BigQuery as the source",
+ "Follow these steps to use BigQuery as your source."
]
},
{
@@ -741,7 +744,8 @@
"id": "53ee7f24-5625-475a-b8cc-9c031591f304"
},
"source": [
- "#### Use a CSV file as the source"
+ "#### Use a CSV file as the source",
+ "Follow these steps to use a CSV file as your source."
]
},
{
@@ -826,7 +830,7 @@
"## Pattern 4: Inference with multiple models in the same pipeline\n",
"This pattern demonstrates how use inference with multiple models in
the same pipeline.\n",
"\n",
- "### Inference with multiple models in parallel\n",
+ "### Multiple models in parallel\n",
"This section demonstrates how use inference with multiple models in
parallel."
]
},
@@ -926,7 +930,7 @@
"id": "e71e6706-5d8d-4322-9def-ac7fb20d4a50"
},
"source": [
- "### Inference with multiple models in sequence\n",
+ "### Multiple models in sequence\n",
"This section demonstrates how use inference with multiple models in
sequence.\n",
"\n",
"In a sequential pattern, data is sent to one or more models in
sequence, \n",
diff --git
a/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb
b/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb
index 3dac52f9d7a..60f79d63a5b 100644
--- a/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb
+++ b/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb
@@ -17,11 +17,6 @@
"cells": [
{
"cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "LzOTNrs_P6Vv"
- },
- "outputs": [],
"source": [
"# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
"\n",
@@ -41,16 +36,13 @@
"# KIND, either express or implied. See the License for the\n",
"# specific language governing permissions and limitations\n",
"# under the License"
- ]
- },
- {
- "cell_type": "markdown",
+ ],
"metadata": {
+ "cellView": "form",
"id": "faayYQYrQzY3"
- },
- "source": [
- "## Use RunInference in Apache Beam"
- ]
+ },
+ "execution_count": null,
+ "outputs": []
},
{
"cell_type": "markdown",
@@ -58,8 +50,9 @@
"id": "JjAt1GesQ9sg"
},
"source": [
- "Starting with Apache Beam 2.40.0, you can use Apache Beam with the
RunInference API to use machine learning (ML) models for local and remote
inference with batch and streaming pipelines.\n",
- "The RunInference API leverages Apache Beam concepts, such as the
BatchElements transform and the Shared class, to support models in your
pipelines that create transforms optimized for machine learning inferences.\n",
+ "# Use RunInference in Apache Beam\n",
+ "You can use Apache Beam versions 2.40.0 and later with the
[RunInference
API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
for local and remote inference with batch and streaming pipelines.\n",
+ "The RunInference API leverages Apache Beam concepts, such as the
`BatchElements` transform and the `Shared` class, to support models in your
pipelines that create transforms optimized for machine learning inference.\n",
"\n",
"For more information about the RunInference API, see [Machine
Learning](https://beam.apache.org/documentation/sdks/python-machine-learning)
in the Apache Beam documentation."
]
@@ -70,13 +63,13 @@
"id": "A8xNRyZMW1yK"
},
"source": [
- "This notebook demonstrates how to use the RunInference API with three
popular ML frameworks: PyTorch, TensorFlow, and scikit-learn. The three
pipelines use a text classification model for generating predictions.\n",
+ "This example demonstrates how to use the RunInference API with three
popular ML frameworks: PyTorch, TensorFlow, and scikit-learn. The three
pipelines use a text classification model for generating predictions.\n",
"\n",
"Follow these steps to build a pipeline:\n",
"* Read the images.\n",
"* If needed, preprocess the text.\n",
- "* Inference with the PyTorch, TensorFlow, or Scikit-learn model.\n",
- "* If needed, postprocess the output from RunInference."
+ "* Run inference with the PyTorch, TensorFlow, or Scikit-learn
model.\n",
+ "* If needed, postprocess the output."
]
},
{
@@ -126,9 +119,9 @@
"id": "ObRPUrlEbjHj"
},
"source": [
- "### Model\n",
+ "### Install the model\n",
"\n",
- "This example uses a pretrained text classification model,
[distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you).
This model is a checkpoint of DistilBERT-base-uncased, fine-tuned on the SST-2
dataset.\n"
+ "This example uses a pretrained text classification model,
[distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you).
This model is a checkpoint of `DistilBERT-base-uncased`, fine-tuned on the
SST-2 dataset.\n"
]
},
{
@@ -165,7 +158,7 @@
"id": "vA1UmbFRb5C-"
},
"source": [
- "### Helper functions\n",
+ "### Install helper functions\n",
"\n",
"The model also uses helper functions."
]
@@ -231,9 +224,9 @@
"id": "WYYbQTMWctkW"
},
"source": [
- "### RunInference pipeline\n",
+ "### Run the pipeline\n",
"\n",
- "This section demonstrates how to use create and run the RunInference
pipeline."
+ "This section demonstrates how to create and run the RunInference
pipeline."
]
},
{
@@ -797,7 +790,7 @@
"id": "h2JP7zsqerCT"
},
"source": [
- "### Model"
+ "### Install the model"
]
},
{
@@ -827,7 +820,7 @@
"id": "GZ-Ioc8ZfyIT"
},
"source": [
- "### Helper functions\n",
+ "### Install helper functions\n",
"\n",
"The model also uses helper functions."
]
@@ -874,7 +867,7 @@
"id": "PZVwI4BbgaAI"
},
"source": [
- "### Prepare the Input\n",
+ "### Prepare the input\n",
"\n",
"This section demonstrates how to prepare the input for your model."
]
@@ -921,9 +914,9 @@
"id": "BYkQl_l8gRgo"
},
"source": [
- "### RunInference Pipeline\n",
+ "### Run the pipeline\n",
"\n",
- "This section demonstrates how to use create and run the RunInference
pipeline."
+ "This section demonstrates how to create and run the RunInference
pipeline."
]
},
{
@@ -991,7 +984,7 @@
"id": "6ArL_55kjxkO"
},
"source": [
- "### Install Dependencies\n",
+ "### Install dependencies\n",
"\n",
"First, download and install the dependencies."
]
@@ -1030,7 +1023,7 @@
"id": "-7ABKlZvkFHy"
},
"source": [
- "### Model\n",
+ "### Install the model\n",
"\n",
"To classify movie reviews as either positive or negative, train and
save a sentiment analysis pipeline about movie reviews."
]
@@ -1059,9 +1052,9 @@
"id": "KL4Cx8s0mBqn"
},
"source": [
- "### RunInference Pipeline\n",
+ "### Run the pipeline\n",
"\n",
- "This section demonstrates how to use create and run the RunInference
pipeline."
+ "This section demonstrates how to create and run the RunInference
pipeline."
]
},
{
diff --git a/examples/notebooks/beam-ml/run_inference_sklearn.ipynb
b/examples/notebooks/beam-ml/run_inference_sklearn.ipynb
index 9afcccc30f6..c9e151750a3 100644
--- a/examples/notebooks/beam-ml/run_inference_sklearn.ipynb
+++ b/examples/notebooks/beam-ml/run_inference_sklearn.ipynb
@@ -51,21 +51,21 @@
},
"source": [
"# Apache Beam RunInference for scikit-learn\n",
- "This notebook demonstrates the use of the RunInference transform for
[scikit-learn](https://scikit-learn.org/) also called sklearn.\n",
+ "This notebook demonstrates the use of the RunInference transform for
[scikit-learn](https://scikit-learn.org/), also called sklearn.\n",
"Apache Beam
[RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
has implementations of the
[ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler)
class prebuilt for scikit-learn. For more information about the RunInference
API, see [Machine
Learning](https://beam.apache.org/documentation/sdks/python-machine [...]
"\n",
- "Users can choose a model handler for their input data type:\n",
- "* The [numpy model
handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n",
- "* The [pandas dataframes model
handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n",
+ "You can choose the appropriate model handler based on your input data
type:\n",
+ "* [NumPy model
handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n",
+ "* [Pandas DataFrame model
handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n",
"\n",
- "With RunInference, these ModelHandlers manage batching,
vectorization, and prediction optimization for your scikit-learn pipeline or
model.\n",
+ "With RunInference, these model handlers manage batching,
vectorization, and prediction optimization for your scikit-learn pipeline or
model.\n",
"\n",
"This notebook demonstrates the following common RunInference
patterns:\n",
"* Generate predictions.\n",
"* Postprocess results after RunInference.\n",
- "* Inference with multiple models in the same pipeline.\n",
+ "* Run inference with multiple models in the same pipeline.\n",
"\n",
- "The linear regression models used in these samples are trained on
data that correspondes to the 5 and 10 times table; that is,`y = 5x` and `y =
10x` respectively."
+ "The linear regression models used in these samples are trained on
data that correspondes to the 5 and 10 times tables; that is,`y = 5x` and `y =
10x` respectively."
]
},
{
@@ -75,7 +75,7 @@
"Complete the following setup steps:\n",
"1. Install dependencies for Apache Beam.\n",
"1. Authenticate with Google Cloud.\n",
- "1. Specify your project and bucket. You need the project and bucket
to save and load models."
+ "1. Specify your project and bucket. You use the project and bucket to
save and load models."
],
"metadata": {
"id": "zzwnMzzgdyPB"
@@ -176,7 +176,7 @@
"2. Train the linear regression model.\n",
"3. Save the scikit-learn model using `pickle`.\n",
"\n",
- "In this example, we create two models, one with the 5 times model and
a section with the 10 times model."
+ "In this example, you create two models, one with the 5 times model
and a second with the 10 times model."
]
},
{
@@ -214,9 +214,9 @@
"id": "69008a3d-3d15-4643-828c-b0419b347d01"
},
"source": [
- "### scikit-learn RunInference pipeline\n",
- "This section demonstrates the following steps:\n",
- "1. Define the scikit-learn model handler that accepts an `array_like`
object as input.\n",
+ "### Create a scikit-learn RunInference pipeline\n",
+ "This section demonstrates how to do the following:\n",
+ "1. Define a scikit-learn model handler that accepts an `array_like`
object as input.\n",
"2. Read the data from BigQuery.\n",
"3. Use the scikit-learn trained model and the scikit-learn
RunInference transform on unkeyed data."
]
@@ -360,8 +360,8 @@
"id": "33e901d6-ed06-4268-8a5f-685d31b5558f"
},
"source": [
- "### Sklearn RunInference on keyed inputs.\n",
- "This section demonstrates the following steps:\n",
+ "### Use sklearn RunInference on keyed inputs\n",
+ "This section demonstrates how to do the following:\n",
"1. Wrap the `SklearnModelHandlerNumpy` object around
`KeyedModelHandler` to handle keyed data.\n",
"2. Read the data from BigQuery.\n",
"3. Use the sklearn trained model and the sklearn RunInference
transform on a keyed data."
@@ -410,7 +410,7 @@
"source": [
"## Run multiple models\n",
"\n",
- "This pipeline takes two RunInference transforms with different models
and then combines the output."
+ "This code creates a pipeline that takes two RunInference transforms
with different models and then combines the output."
],
"metadata": {
"id": "JQ4zvlwsRK1W"
diff --git a/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb
b/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb
index 3e2e9e428ae..81e3bd38cac 100644
--- a/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb
+++ b/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb
@@ -7,8 +7,8 @@
"collapsed_sections": []
},
"kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
+ "name": "python3",
+ "display_name": "Python 3"
},
"language_info": {
"name": "python"
@@ -39,6 +39,7 @@
"# under the License"
],
"metadata": {
+ "cellView": "form",
"id": "fFjof1NgAJwu"
},
"execution_count": null,
@@ -49,11 +50,11 @@
"source": [
"# Apache Beam RunInference with TensorFlow\n",
"This notebook demonstrates the use of the RunInference transform for
[TensorFlow](https://www.tensorflow.org/).\n",
- "Beam
[RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
accepts a ModelHandler generated from
[`tfx-bsl`](https://github.com/tensorflow/tfx-bsl) via CreateModelHandler.\n",
+ "Beam
[RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
accepts a ModelHandler generated from
[`tfx-bsl`](https://github.com/tensorflow/tfx-bsl) using
`CreateModelHandler`.\n",
"\n",
- "The Apache Beam RunInference transform is used for making predictions
for\n",
+ "The Apache Beam RunInference transform is used to make predictions
for\n",
"a variety of machine learning models. In versions 1.10.0 and later of
`tfx-bsl`, you can\n",
- "create a TensorFlow ModelHandler for use with Apache Beam. For more
information about the RunInference API, see [Machine
Learning](https://beam.apache.org/documentation/sdks/python-machine-learning)
in the Apache Beam documentation.\n",
+ "create a TensorFlow `ModelHandler` for use with Apache Beam. For more
information about the RunInference API, see [Machine
Learning](https://beam.apache.org/documentation/sdks/python-machine-learning)
in the Apache Beam documentation.\n",
"\n",
"This notebook demonstrates the following steps:\n",
"- Import [`tfx-bsl`](https://github.com/tensorflow/tfx-bsl).\n",
@@ -68,6 +69,9 @@
{
"cell_type": "markdown",
"source": [
+ "## Before you begin\n",
+ "Complete the following setup steps.\n",
+ "\n",
"First, import `tfx-bsl`."
],
"metadata": {
@@ -123,7 +127,7 @@
{
"cell_type": "markdown",
"source": [
- "## Authenticate with Google Cloud\n",
+ "### Authenticate with Google Cloud\n",
"This notebook relies on saving your model to Google Cloud. To use
your Google Cloud account, authenticate this notebook."
],
"metadata": {
@@ -145,7 +149,7 @@
{
"cell_type": "markdown",
"source": [
- "## Import dependencies and set up your bucket\n",
+ "### Import dependencies and set up your bucket\n",
"Replace `PROJECT_ID` and `BUCKET_NAME` with the ID of your project
and the name of your bucket.\n",
"\n",
"**Important**: If an error occurs, restart your runtime."
@@ -193,12 +197,20 @@
"source": [
"## Create and test a simple model\n",
"\n",
- "This step creates a model that predicts the 5 times table."
+ "This step creates and tests a model that predicts the 5 times table."
],
"metadata": {
"id": "YzvZWEv-1oiK"
}
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Create the model\n",
+ "Create training data and build a linear regression model."
+ ]
+ },
{
"cell_type": "code",
"metadata": {
@@ -296,7 +308,7 @@
"source": [
"### Populate the data in a TensorFlow proto\n",
"\n",
- "Tensorflow data uses protos. If you are loading from a file, helpers
exist for this step. Because we are using generated data, this code populates a
proto."
+ "Tensorflow data uses protos. If you are loading from a file, helpers
exist for this step. Because this example uses generated data, this code
populates a proto."
],
"metadata": {
"id": "dEmleqiH3t71"
@@ -356,7 +368,7 @@
"source": [
"### Fit The Model\n",
"\n",
- "This example builds a model. Because RunInference requires pretrained
models, this segment builds a usable model."
+ "This step builds a model. Because RunInference requires pretrained
models, this segment builds a usable model."
],
"metadata": {
"id": "G-sAu3cf31f3"
@@ -445,6 +457,7 @@
"cell_type": "markdown",
"source": [
"## Run the Pipeline\n",
+ "Use the following code to run the pipeline.\n",
"\n",
"`FormatOutput` demonstrates how to extract values from the output
protos.\n",
"\n",
@@ -507,11 +520,10 @@
"\n",
"By default, the `ModelHandler` does not expect a key.\n",
"\n",
- "If you know that keys are associated with your examples, wrap the
model handler with `beam.KeyedModelHandler`.\n",
- "\n",
- "If you don't know whether keys are associated with your examples, use
`beam.MaybeKeyedModelHandler`.\n",
+ "* If you know that keys are associated with your examples, wrap the
model handler with `beam.KeyedModelHandler`.\n",
+ "* If you don't know whether keys are associated with your examples,
use `beam.MaybeKeyedModelHandler`.\n",
"\n",
- "This step also illustrates how to use `tfx-bsl` examples."
+ "In addition to demonstrating how to use a keyed model handler, this
step demonstrates how to use `tfx-bsl` examples."
],
"metadata": {
"id": "IXikjkGdHm9n"
@@ -583,4 +595,4 @@
]
}
]
-}
+}
\ No newline at end of file