[GitHub] [beam] damccorm commented on a diff in pull request #26404: Add run_inference windowing notebook

via GitHub Mon, 15 May 2023 08:53:44 -0700


damccorm commented on code in PR #26404:
URL: https://github.com/apache/beam/pull/26404#discussion_r1194031724



##########
examples/notebooks/beam-ml/run_inference_windowing.ipynb:
##########
@@ -0,0 +1,416 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "K2MpsIa-ncMZ"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Use windowing with RunInference predictions \n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_windowing.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_windowing.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "fKxfINuCPsh9"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook shows how to use the RunInference transform with 
[windowing](https://beam.apache.org/documentation/programming-guide/#windowing) 
in a streaming pipeline. Windowing is useful when your data arrives within a 
particular timeframe and can be divided by timestamp, or when you want to see 
trends before all the data is processed. In this example, the pipeline predicts 
the quality of milk samples and classifies them as `good`, `bad`, or `medium`. 
The pipeline then aggregates the predictions for each window. To make 
predictions, the pipeline uses the XGBoost model handler. For more information 
about the RunInference API, see the [Machine Learning section of the Apache 
Beam documentation](https://beam.apache.org/documentation/ml/overview/).\n",
+        "\n",
+        "With RunInference, a model handler manages batching, vectorization, 
and prediction optimization for your XGBoost pipeline or model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "- Generate predictions for all samples in a window.\n",
+        "- Aggregate the results per window after running inference.\n",
+        "- Print the aggregations."
+      ],
+      "metadata": {
+        "id": "knGVsVR6P_nZ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Before you begin\n",
+        "Complete the following setup steps:\n",
+        "- Install dependencies for Apache Beam.\n",
+        "- Install XGBoost.\n",
+        "- Download the [Milk Quality Prediction dataset from 
Kaggle](https://www.kaggle.com/datasets/cpluzshrijayan/milkquality). Name the 
dataset `milk_quality.csv`, and put it in the current directory. Use the CSV 
file format for the dataset."
+      ],
+      "metadata": {
+        "id": "s5PPNo9HRRe1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam==2.47.0\n",
+        "!pip install xgboost"

Review Comment:
   ```suggestion
           "!pip install xgboost"
           "# You may need to install a different version of Datatable directly 
depending on environment"
           "!pip install datatable"
   ```



##########
examples/notebooks/beam-ml/run_inference_windowing.ipynb:
##########
@@ -0,0 +1,416 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "K2MpsIa-ncMZ"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Use windowing with RunInference predictions \n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_windowing.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_windowing.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "fKxfINuCPsh9"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook shows how to use the RunInference transform with 
[windowing](https://beam.apache.org/documentation/programming-guide/#windowing) 
in a streaming pipeline. Windowing is useful when your data arrives within a 
particular timeframe and can be divided by timestamp, or when you want to see 
trends before all the data is processed. In this example, the pipeline predicts 
the quality of milk samples and classifies them as `good`, `bad`, or `medium`. 
The pipeline then aggregates the predictions for each window. To make 
predictions, the pipeline uses the XGBoost model handler. For more information 
about the RunInference API, see the [Machine Learning section of the Apache 
Beam documentation](https://beam.apache.org/documentation/ml/overview/).\n",
+        "\n",
+        "With RunInference, a model handler manages batching, vectorization, 
and prediction optimization for your XGBoost pipeline or model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "- Generate predictions for all samples in a window.\n",
+        "- Aggregate the results per window after running inference.\n",
+        "- Print the aggregations."
+      ],
+      "metadata": {
+        "id": "knGVsVR6P_nZ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Before you begin\n",
+        "Complete the following setup steps:\n",
+        "- Install dependencies for Apache Beam.\n",
+        "- Install XGBoost.\n",
+        "- Download the [Milk Quality Prediction dataset from 
Kaggle](https://www.kaggle.com/datasets/cpluzshrijayan/milkquality). Name the 
dataset `milk_quality.csv`, and put it in the current directory. Use the CSV 
file format for the dataset."
+      ],
+      "metadata": {
+        "id": "s5PPNo9HRRe1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam==2.47.0\n",
+        "!pip install xgboost"
+      ],
+      "metadata": {
+        "id": "YiPD9-j_RRNC"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## About the dataset\n",
+        "\n",
+        "This dataset is a CSV file that contains seven columns: `pH`, 
`temperature`, `taste`, `odor`, `fat`, `turbidity`, and `color`. The dataset 
also contains a column that labels the quality of each sample as `good`, `bad`, 
or `medium`."
+      ],
+      "metadata": {
+        "id": "Uz9BcQg_Qbva"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import argparse\n",
+        "import logging\n",
+        "import time\n",
+        "from typing import NamedTuple\n",
+        "\n",
+        "import pandas\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "import apache_beam as beam\n",
+        "import xgboost\n",
+        "from apache_beam import window\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.base import PredictionResult\n",
+        "from apache_beam.ml.inference.xgboost_inference import 
XGBoostModelHandlerPandas\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions\n",
+        "from apache_beam.options.pipeline_options import SetupOptions\n",
+        "from apache_beam.runners.runner import PipelineResult\n",
+        "from apache_beam.testing.test_stream import TestStream"
+      ],
+      "metadata": {
+        "id": "sHDrJ1nTPqUv"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the dataset and train the XGBoost model\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the Milk Quality Prediction dataset from Kaggle.\n",
+        "2. Split the data into a training set and a test set.\n",
+        "2. Train the XGBoost classifier to predict the quality of milk.\n",
+        "3. Save the model in a JSON file using `mode.save_model`. For more 
information, see [Introduction to Model 
IO](https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html) in 
the XGBoost documentation. \n"
+      ],
+      "metadata": {
+        "id": "kpXjNoVgRpOb"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "DATASET = \"dataset.csv\"\n",

Review Comment:
   ```suggestion
           "# Replace with the path to milk_quality.csv\n",
           "DATASET = \"milk_quality.csv\"\n\n",
   ```



##########
examples/notebooks/beam-ml/run_inference_windowing.ipynb:
##########
@@ -0,0 +1,416 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "K2MpsIa-ncMZ"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Use windowing with RunInference predictions \n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_windowing.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_windowing.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "fKxfINuCPsh9"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook shows how to use the RunInference transform with 
[windowing](https://beam.apache.org/documentation/programming-guide/#windowing) 
in a streaming pipeline. Windowing is useful when your data arrives within a 
particular timeframe and can be divided by timestamp, or when you want to see 
trends before all the data is processed. In this example, the pipeline predicts 
the quality of milk samples and classifies them as `good`, `bad`, or `medium`. 
The pipeline then aggregates the predictions for each window. To make 
predictions, the pipeline uses the XGBoost model handler. For more information 
about the RunInference API, see the [Machine Learning section of the Apache 
Beam documentation](https://beam.apache.org/documentation/ml/overview/).\n",
+        "\n",
+        "With RunInference, a model handler manages batching, vectorization, 
and prediction optimization for your XGBoost pipeline or model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "- Generate predictions for all samples in a window.\n",
+        "- Aggregate the results per window after running inference.\n",
+        "- Print the aggregations."
+      ],
+      "metadata": {
+        "id": "knGVsVR6P_nZ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Before you begin\n",
+        "Complete the following setup steps:\n",
+        "- Install dependencies for Apache Beam.\n",
+        "- Install XGBoost.\n",
+        "- Download the [Milk Quality Prediction dataset from 
Kaggle](https://www.kaggle.com/datasets/cpluzshrijayan/milkquality). Name the 
dataset `milk_quality.csv`, and put it in the current directory. Use the CSV 
file format for the dataset."

Review Comment:
   ```suggestion
           "- Download the [Milk Quality Prediction dataset from 
Kaggle](https://www.kaggle.com/datasets/cpluzshrijayan/milkquality). Name the 
dataset `milk_quality.csv`, and put it in the current directory. Use the CSV 
file format for the dataset. If using colab, you will need to [upload it to the 
colab filesystem](https://neptune.ai/blog/google-colab-dealing-with-files)."
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #26404: Add run_inference windowing notebook

Reply via email to