[GitHub] [beam] AnandInguva commented on a diff in pull request #27402: MLTransform basic notebook

via GitHub Mon, 11 Sep 2023 14:05:26 -0700


AnandInguva commented on code in PR #27402:
URL: https://github.com/apache/beam/pull/27402#discussion_r1322073355



##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ],
+      "metadata": {
+        "id": "34gTXZ7BIArp"
+      },
+      "id": "34gTXZ7BIArp",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# MLTransform\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "0n0YAd-0KQyi"
+      },
+      "id": "0n0YAd-0KQyi"
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+      "metadata": {
+        "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+      },
+      "source": [
+        "This notebook demonstrates how to use `MLTransform` to preprocess 
your data for machine learning models. `MLTransform` is a `PTransform` that 
wraps multiple Apache Beam data processing transforms in one transform. As a 
result, `MLTransform` gives you the ability to preprocess different types of 
data in multiple ways with one transform.\n",
+        "\n",
+        "This notebook uses data processing transforms defined in the 
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
 module. For a full list of available transforms, see the 
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n";
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+      "metadata": {
+        "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+      },
+      "source": [
+        "## Import the requried modules.\n",
+        "\n",
+        "To use `MLTransfrom`, install `tensorflow_transform` and the Apache 
Beam SDK version 2.50.0 or later.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install tensorflow_transform --quiet\n",
+        "!pip install apache_beam>=2.50.0 --quiet"
+      ],
+      "metadata": {
+        "id": "MRWkC-n2DmjM"
+      },
+      "id": "MRWkC-n2DmjM",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+      "metadata": {
+        "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+      },
+      "outputs": [],
+      "source": [
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.tft import 
ComputeAndApplyVocabulary\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions\n",
+        "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Artifacts are additional data elements created by data 
transformations. Examples of artifacts are the `minimum` and `maximum` values 
from a `ScaleTo01` transformation, or the `mean` and `variance` from a 
`ScaleToZScore` transformation. Look at 
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more 
details on artifacts.\n",
+        "\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "90nXXc_A4Bmf"
+      },
+      "id": "90nXXc_A4Bmf"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+      "metadata": {
+        "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+      },
+      "outputs": [],
+      "source": [
+        "# store artifacts generated by MLTransform.\n",
+        "# Each MLTransform instance requires an artifact location to be 
empty.\n",
+        "# We use this method to delete and refresh the artifact location for 
each example.\n",
+        "artifact_location = './my_artifacts'\n",
+        "def delete_artifact_location(artifact_location):\n",
+        "  import shutil\n",
+        "  import os\n",
+        "  if os.path.exists(artifact_location):\n",
+        "      shutil.rmtree(artifact_location)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+      "metadata": {
+        "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+      },
+      "source": [
+        "## Compute and map the vocabulary\n",
+        "\n",
+        "\n",
+        "`ComputeAndApplyVocabulary` is a data processing transform that 
computes a unique vocabulary from a dataset and then maps each word or token to 
a distinct integer index. It facilitates transforming textual data into 
numerical representations for machine learning tasks.\n",
+        "\n",
+        "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+        "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n",
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+          ]
+        },

Review Comment:
   I will remove them. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] AnandInguva commented on a diff in pull request #27402: MLTransform basic notebook

Reply via email to