[GitHub] [beam] damccorm commented on a diff in pull request #25904: Add XGBoost example notebook

via GitHub Mon, 27 Mar 2023 09:15:51 -0700


damccorm commented on code in PR #25904:
URL: https://github.com/apache/beam/pull/25904#discussion_r1149469575



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",

Review Comment:
   ```suggestion
           "This notebook demonstrates the use of the RunInference transform 
for XGBoost. Apache Beam RunInference has implementations of the ModelHandler 
class prebuilt for XGBoost. For more information about the RunInference API, 
see the [Machine Learning section of the Apache Beam 
documentation](https://beam.apache.org/documentation/ml/overview/).\n",
   ```



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Before you begin\n",
+        "Complete the following setup steps:\n",

Review Comment:
   ```suggestion
           "Before you begin, complete the following setup steps:\n",
   ```
   
   Not sure why, but this renders incorrectly in colab and GitHub (the trailing 
`\n` isn't respected).
   
   <img width="406" alt="image" 
src="https://user-images.githubusercontent.com/42773683/227998497-36143a83-85f6-4139-9ad0-a7ae1a806b30.png";>
   



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Before you begin\n",
+        "Complete the following setup steps:\n",
+        "\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam[gcp,dataframe] --quiet"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import 
XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn 
and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. 
(https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers 
and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, 
y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, 
random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, 
objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, 
y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], 
test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a 
single element\n",
+        "# More information: 
https://xgboost.readthedocs.io/en/stable/prediction.html\n";,
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a scikit-learn RunInference pipeline\n",
+        "This section demonstrates how to do the following:\n",
+        "1. Define a XGBoost model handler that accepts an `numpy.ndarray` 
object as input.\n",
+        "2. Load the data from the datasets.\n",
+        "3. Use the XGBoost trained models and the XGBoost RunInference 
transform on unkeyed data."
+      ],
+      "metadata": {
+        "id": "ItuxdQoXSNTQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_classification_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBClassifier, 
model_state=CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_classification)\n",
+        "      | \"RunInferenceXGBoost\" >>\n",
+        "      
RunInference(model_handler=xgboost_classification_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "SBdMq3-CSGqZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_regression_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBRegressor, 
model_state=REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_regression)\n",
+        "      | \"RunInferenceSklearn\" >>\n",

Review Comment:
   ```suggestion
           "      | \"RunInferenceXGBoost\" >>\n",
   ```



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Before you begin\n",
+        "Complete the following setup steps:\n",
+        "\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam[gcp,dataframe] --quiet"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import 
XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn 
and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. 
(https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers 
and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, 
y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, 
random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, 
objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, 
y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], 
test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a 
single element\n",
+        "# More information: 
https://xgboost.readthedocs.io/en/stable/prediction.html\n";,
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a scikit-learn RunInference pipeline\n",
+        "This section demonstrates how to do the following:\n",
+        "1. Define a XGBoost model handler that accepts an `numpy.ndarray` 
object as input.\n",
+        "2. Load the data from the datasets.\n",
+        "3. Use the XGBoost trained models and the XGBoost RunInference 
transform on unkeyed data."
+      ],
+      "metadata": {
+        "id": "ItuxdQoXSNTQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_classification_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBClassifier, 
model_state=CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_classification)\n",
+        "      | \"RunInferenceXGBoost\" >>\n",
+        "      
RunInference(model_handler=xgboost_classification_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "SBdMq3-CSGqZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_regression_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBRegressor, 
model_state=REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_regression)\n",
+        "      | \"RunInferenceSklearn\" >>\n",
+        "      RunInference(model_handler=xgboost_regression_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "IYUXIJt7UIm6"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Use XGBoost RunInference on keyed inputs\n",
+        "This section demonstrates how to do the following:\n",
+        "1. Wrap the `XGBoostHandlerNumpy` object around `KeyedModelHandler` 
to handle keyed data.\n",

Review Comment:
   ```suggestion
           "1. Wrap the `XGBoostHandlerNumpy` with a `KeyedModelHandler` to 
handle keyed data.\n",
   ```



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"

Review Comment:
   Actually, it looks like we don't really do any postprocessing anyways (which 
is fine, printing is enough). So maybe we can just cut this and say `This 
notebook uses RunInference to perform classification of Iris flowers and 
regression to predict housing prices`.



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Before you begin\n",
+        "Complete the following setup steps:\n",
+        "\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam[gcp,dataframe] --quiet"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import 
XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn 
and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. 
(https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers 
and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, 
y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, 
random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, 
objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, 
y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], 
test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a 
single element\n",
+        "# More information: 
https://xgboost.readthedocs.io/en/stable/prediction.html\n";,
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a scikit-learn RunInference pipeline\n",
+        "This section demonstrates how to do the following:\n",
+        "1. Define a XGBoost model handler that accepts an `numpy.ndarray` 
object as input.\n",
+        "2. Load the data from the datasets.\n",
+        "3. Use the XGBoost trained models and the XGBoost RunInference 
transform on unkeyed data."
+      ],
+      "metadata": {
+        "id": "ItuxdQoXSNTQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_classification_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBClassifier, 
model_state=CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_classification)\n",
+        "      | \"RunInferenceXGBoost\" >>\n",
+        "      
RunInference(model_handler=xgboost_classification_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "SBdMq3-CSGqZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_regression_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBRegressor, 
model_state=REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_regression)\n",
+        "      | \"RunInferenceSklearn\" >>\n",
+        "      RunInference(model_handler=xgboost_regression_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "IYUXIJt7UIm6"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Use XGBoost RunInference on keyed inputs\n",
+        "This section demonstrates how to do the following:\n",
+        "1. Wrap the `XGBoostHandlerNumpy` object around `KeyedModelHandler` 
to handle keyed data.\n",
+        "2. Load the data from the datasets.\n",
+        "3. Use the XGBoost trained models and the XGBoost RunInference 
transform on the keyed data."
+      ],
+      "metadata": {
+        "id": "ptTZUGmqW4s2"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "x_test_classification = [(f'batch {i}', sample) for i, sample in 
enumerate(x_test_classification)]\n",
+        "x_test_regression = [(f'batch {i}', sample for i, sample in 
enumerate(x_test_regression)]"
+      ],
+      "metadata": {
+        "id": "MBSbY569W3zm"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "keyed_xgboost_regression_model_handler = 
KeyedModelHandler(xgboost_classification_model_handler)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_classification)\n",
+        "      | \"RunInferenceXGBoost\" >>\n",
+        "      
RunInference(model_handler=keyed_xgboost_regression_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "8L7sU7a5YXrI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "keyed_xgboost_regression_model_handler = 
KeyedModelHandler(xgboost_regression_model_handler)\n",
+        "\n",
+        "\n",

Review Comment:
   ```suggestion
           "keyed_xgboost_regression_model_handler = 
KeyedModelHandler(xgboost_regression_model_handler)\n",
           "\n",
   ```
   
   Nit, inconsistent spacing with other examples



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Before you begin\n",
+        "Complete the following setup steps:\n",
+        "\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam[gcp,dataframe] --quiet"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import 
XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn 
and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. 
(https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers 
and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, 
y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, 
random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, 
objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, 
y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], 
test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a 
single element\n",
+        "# More information: 
https://xgboost.readthedocs.io/en/stable/prediction.html\n";,
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a scikit-learn RunInference pipeline\n",

Review Comment:
   ```suggestion
           "### Create an XGBoost RunInference pipeline\n",
   ```



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"

Review Comment:
   ```suggestion
           "- Generate predictions\n",
           "- Postprocess results after RunInference\n",
           "- One model to showcase classification of Iris flowers\n",
           "- One regression model to showcase prediction of housing prices"
   ```
   
   Nit: I think this reads cleaner with bullet points (and will render 
correctly with the `\n` problem mentioned later)



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,321 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for 
XGBoost. Apache Beam RunInference has implementations of the ModelHandler class 
prebuilt for XGBoost. For more information about the RunInference API, see 
Machine Learning in the Apache Beam documentation.\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data 
type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, 
vectorization, and prediction optimization for your XGBoost pipeline or 
model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference 
patterns:\n",
+        "\n",
+        "Generate predictions.\n",
+        "Postprocess results after RunInference.\n",
+        "One model to showcase classification of Iris flowers and one 
regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Before you begin\n",
+        "Complete the following setup steps:\n",
+        "\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install apache-beam[gcp,dataframe] --quiet"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import 
XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn 
and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. 
(https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers 
and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, 
y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, 
random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, 
objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, 
y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], 
test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a 
single element\n",
+        "# More information: 
https://xgboost.readthedocs.io/en/stable/prediction.html\n";,
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a scikit-learn RunInference pipeline\n",
+        "This section demonstrates how to do the following:\n",
+        "1. Define a XGBoost model handler that accepts an `numpy.ndarray` 
object as input.\n",
+        "2. Load the data from the datasets.\n",
+        "3. Use the XGBoost trained models and the XGBoost RunInference 
transform on unkeyed data."
+      ],
+      "metadata": {
+        "id": "ItuxdQoXSNTQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_classification_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBClassifier, 
model_state=CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_classification)\n",
+        "      | \"RunInferenceXGBoost\" >>\n",
+        "      
RunInference(model_handler=xgboost_classification_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "SBdMq3-CSGqZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xgboost_regression_model_handler = XGBoostModelHandlerNumpy(\n",
+        "    model_class=xgboost.XGBRegressor, 
model_state=REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "pipeline_options = PipelineOptions().from_dictionary({})\n",
+        "\n",
+        "with beam.Pipeline(options=pipeline_options) as p:\n",
+        "  (\n",
+        "      p\n",
+        "      | \"Load Data\" >> beam.Create(x_test_regression)\n",
+        "      | \"RunInferenceSklearn\" >>\n",
+        "      RunInference(model_handler=xgboost_regression_model_handler)\n",
+        "      | beam.Map(print))"
+      ],
+      "metadata": {
+        "id": "IYUXIJt7UIm6"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Use XGBoost RunInference on keyed inputs\n",

Review Comment:
   Could you add a sentence motivating this section? Something like: 
   ```
   It is often useful to associate examples with a key before doing inference 
so that you can retain metadata about the example (e.g. the original url of a 
preprocessed image or a non-preprocessed input). RunInference allows you to do 
this using a `KeyedModelHandler`. This section demonstrates how to do the 
following with a KeyedModelHandler:
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #25904: Add XGBoost example notebook

Reply via email to