[GitHub] [beam] riteshghorse commented on a diff in pull request #27402: MLTransform basic notebook

via GitHub Mon, 11 Sep 2023 11:14:33 -0700


riteshghorse commented on code in PR #27402:
URL: https://github.com/apache/beam/pull/27402#discussion_r1321898989



##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ],
+      "metadata": {
+        "id": "34gTXZ7BIArp"
+      },
+      "id": "34gTXZ7BIArp",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# MLTransform\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "0n0YAd-0KQyi"
+      },
+      "id": "0n0YAd-0KQyi"
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+      "metadata": {
+        "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+      },
+      "source": [
+        "This notebook demonstrates how to use `MLTransform` to preprocess 
your data for machine learning models. `MLTransform` is a `PTransform` that 
wraps multiple Apache Beam data processing transforms in one transform. As a 
result, `MLTransform` gives you the ability to preprocess different types of 
data in multiple ways with one transform.\n",
+        "\n",
+        "This notebook uses data processing transforms defined in the 
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
 module. For a full list of available transforms, see the 
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n";
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+      "metadata": {
+        "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+      },
+      "source": [
+        "## Import the requried modules.\n",
+        "\n",
+        "To use `MLTransfrom`, install `tensorflow_transform` and the Apache 
Beam SDK version 2.50.0 or later.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install tensorflow_transform --quiet\n",
+        "!pip install apache_beam>=2.50.0 --quiet"
+      ],
+      "metadata": {
+        "id": "MRWkC-n2DmjM"
+      },
+      "id": "MRWkC-n2DmjM",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+      "metadata": {
+        "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+      },
+      "outputs": [],
+      "source": [
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.tft import 
ComputeAndApplyVocabulary\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions\n",
+        "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Artifacts are additional data elements created by data 
transformations. Examples of artifacts are the `minimum` and `maximum` values 
from a `ScaleTo01` transformation, or the `mean` and `variance` from a 
`ScaleToZScore` transformation. Look at 
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more 
details on artifacts.\n",
+        "\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "90nXXc_A4Bmf"
+      },
+      "id": "90nXXc_A4Bmf"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+      "metadata": {
+        "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+      },
+      "outputs": [],
+      "source": [
+        "# store artifacts generated by MLTransform.\n",
+        "# Each MLTransform instance requires an artifact location to be 
empty.\n",
+        "# We use this method to delete and refresh the artifact location for 
each example.\n",
+        "artifact_location = './my_artifacts'\n",
+        "def delete_artifact_location(artifact_location):\n",
+        "  import shutil\n",
+        "  import os\n",
+        "  if os.path.exists(artifact_location):\n",
+        "      shutil.rmtree(artifact_location)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+      "metadata": {
+        "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+      },
+      "source": [
+        "## Compute and map the vocabulary\n",
+        "\n",
+        "\n",
+        "`ComputeAndApplyVocabulary` is a data processing transform that 
computes a unique vocabulary from a dataset and then maps each word or token to 
a distinct integer index. It facilitates transforming textual data into 
numerical representations for machine learning tasks.\n",

Review Comment:
   Add link to transforms if possible, could be helpful for users to quickly 
take a look
   ```suggestion
           "[`ComputeAndApplyVocabulary`]() is a data processing transform that 
computes a unique vocabulary from a dataset and then maps each word or token to 
a distinct integer index. It facilitates transforming textual data into 
numerical representations for machine learning tasks.\n",
   ```



##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ],
+      "metadata": {
+        "id": "34gTXZ7BIArp"
+      },
+      "id": "34gTXZ7BIArp",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# MLTransform\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "0n0YAd-0KQyi"
+      },
+      "id": "0n0YAd-0KQyi"
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+      "metadata": {
+        "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+      },
+      "source": [
+        "This notebook demonstrates how to use `MLTransform` to preprocess 
your data for machine learning models. `MLTransform` is a `PTransform` that 
wraps multiple Apache Beam data processing transforms in one transform. As a 
result, `MLTransform` gives you the ability to preprocess different types of 
data in multiple ways with one transform.\n",
+        "\n",
+        "This notebook uses data processing transforms defined in the 
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
 module. For a full list of available transforms, see the 
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n";
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+      "metadata": {
+        "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+      },
+      "source": [
+        "## Import the requried modules.\n",
+        "\n",
+        "To use `MLTransfrom`, install `tensorflow_transform` and the Apache 
Beam SDK version 2.50.0 or later.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install tensorflow_transform --quiet\n",
+        "!pip install apache_beam>=2.50.0 --quiet"
+      ],
+      "metadata": {
+        "id": "MRWkC-n2DmjM"
+      },
+      "id": "MRWkC-n2DmjM",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+      "metadata": {
+        "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+      },
+      "outputs": [],
+      "source": [
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.tft import 
ComputeAndApplyVocabulary\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions\n",
+        "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Artifacts are additional data elements created by data 
transformations. Examples of artifacts are the `minimum` and `maximum` values 
from a `ScaleTo01` transformation, or the `mean` and `variance` from a 
`ScaleToZScore` transformation. Look at 
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more 
details on artifacts.\n",
+        "\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "90nXXc_A4Bmf"
+      },
+      "id": "90nXXc_A4Bmf"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+      "metadata": {
+        "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+      },
+      "outputs": [],
+      "source": [
+        "# store artifacts generated by MLTransform.\n",
+        "# Each MLTransform instance requires an artifact location to be 
empty.\n",
+        "# We use this method to delete and refresh the artifact location for 
each example.\n",
+        "artifact_location = './my_artifacts'\n",
+        "def delete_artifact_location(artifact_location):\n",
+        "  import shutil\n",
+        "  import os\n",
+        "  if os.path.exists(artifact_location):\n",
+        "      shutil.rmtree(artifact_location)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+      "metadata": {
+        "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+      },
+      "source": [
+        "## Compute and map the vocabulary\n",
+        "\n",
+        "\n",
+        "`ComputeAndApplyVocabulary` is a data processing transform that 
computes a unique vocabulary from a dataset and then maps each word or token to 
a distinct integer index. It facilitates transforming textual data into 
numerical representations for machine learning tasks.\n",
+        "\n",
+        "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+        "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n",
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Row(x=array([1, 0, 4]))\n",
+            "Row(x=array([1, 0, 6, 2, 3, 5]))\n"
+          ]
+        }
+      ],
+      "source": [
+        "delete_artifact_location(artifact_location)\n",
+        "\n",
+        "data = [\n",
+        "    {'x': ['I', 'love', 'pie']},\n",
+        "    {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+        "]\n",
+        "options = PipelineOptions()\n",
+        "with beam.Pipeline(options=options) as p:\n",
+        "    data = (\n",
+        "        p\n",
+        "        | 'CreateData' >> beam.Create(data)\n",
+        "        | 'MLTransform' >> 
MLTransform(write_artifact_location=artifact_location).with_transform(ComputeAndApplyVocabulary(columns=['x']))\n",
+        "        | 'PrintResults' >> beam.Map(print)\n",
+        "    )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "1e133002-7229-459d-8e3c-b41f4d65e76d",
+      "metadata": {
+        "id": "1e133002-7229-459d-8e3c-b41f4d65e76d"
+      },
+      "source": [
+        "### Fetch vocabulary artifacs\n",
+        "\n",
+        "This example generates a file with all the vocabulary in the dataset, 
referred to in `MLTransform` as an artifact. To fetch artifacts generated by 
the `ComputeAndApplyVocabulary` transform, use the `ArtifactsFetcher` class. 
This class fetches both a vocabulary list and a path to the vocabulary file."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+        "outputId": "cd8b6cf3-6093-4b1b-a063-ff327c090a92"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "['love', 'I', 'to', 'the', 'pie', 'park', 'going']\n",
+            "./my_artifacts/transform_fn/assets/compute_and_apply_vocab\n",
+            "7\n"
+          ]
+        }
+      ],
+      "source": [
+        "fetcher = ArtifactsFetcher(artifact_location=artifact_location)\n",
+        "# get vocab list\n",
+        "vocab_list = fetcher.get_vocab_list()\n",
+        "print(vocab_list)\n",
+        "# get vocab file path\n",
+        "vocab_file_path = fetcher.get_vocab_filepath()\n",
+        "print(vocab_file_path)\n",
+        "# get vocab size\n",
+        "vocab_size = fetcher.get_vocab_size()\n",
+        "print(vocab_size)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5f955f3d-3192-42f7-aa55-48249223418d",
+      "metadata": {
+        "id": "5f955f3d-3192-42f7-aa55-48249223418d"
+      },
+      "source": [
+        "## TFIDF\n",
+        "\n",
+        "TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical 
statistic used in text processing to reflect how important a word is to a 
document in a collection or corpus. It balances the frequency of a word in a 
document against its frequency in the entire corpus, giving higher value to 
more specific terms.\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85",
+      "metadata": {
+        "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85"
+      },
+      "outputs": [],
+      "source": [
+        "from apache_beam.ml.transforms.tft import TFIDF"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+        "outputId": "e87409ed-5e33-43fa-d3b6-a0c012636cef"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n",
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:Analyzer 
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key 
varies on repeated tracing. This warning is safe to ignore if you either 
specify `name` for all analyzers or if the order in which they are invoked is 
deterministic. If not, please file a bug with details.\n",
+            "WARNING:absl:Analyzer 
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key 
varies on repeated tracing. This warning is safe to ignore if you either 
specify `name` for all analyzers or if the order in which they are invoked is 
deterministic. If not, please file a bug with details.\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Row(x=array([1, 0, 4]), x_tfidf_weight=array([0.33333334, 
0.33333334, 0.4684884 ], dtype=float32), x_vocab_index=array([0, 1, 4]))\n",
+            "Row(x=array([1, 0, 6, 2, 3, 5]), 
x_tfidf_weight=array([0.16666667, 0.16666667, 0.2342442 , 0.2342442 , 0.2342442 
,\n",
+            "       0.2342442 ], dtype=float32), x_vocab_index=array([0, 1, 2, 
3, 5, 6]))\n"
+          ]
+        }
+      ],
+      "source": [
+        "data = [\n",
+        "    {'x': ['I', 'love', 'pie']},\n",
+        "    {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+        "]\n",
+        "delete_artifact_location(artifact_location)\n",
+        "options = PipelineOptions()\n",
+        "with beam.Pipeline(options=options) as p:\n",
+        "    data = (\n",
+        "        p\n",
+        "        | beam.Create(data)\n",
+        "        | MLTransform(write_artifact_location=artifact_location\n",
+        "                     
).with_transform(ComputeAndApplyVocabulary(columns=['x'])\n",
+        "                     ).with_transform(TFIDF(columns=['x']))\n",
+        "    )\n",
+        "    _ = data | beam.Map(print)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf",
+      "metadata": {
+        "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf"
+      },
+      "source": [
+        "TFIDF provides two outputs. These outputs appear as columns in the 
output file. One output column has the suffix `tfidf_weight`, and the other 
column has the suffix `vocab_index`.\n",
+        "\n",
+        "- `vocab_index`: indices of the words computed in the 
`ComputeAndApplyVocabulary` transform.\n",
+        "- `tfidif_weight`: the weight for each vocabulary index. The weight 
represents how important the word present at that `vocab_index` is to the 
document.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db",
+      "metadata": {
+        "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db"
+      },
+      "source": [
+        "## Scale the data"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4",
+      "metadata": {
+        "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4"
+      },
+      "source": [
+        "### Scale the data between 0 to 1\n",
+        "\n",
+        "Scale the data so that it's in the range of 0 and 1. To scale the 
data, the transform calculates minimum and maximum values on the whole dataset, 
and then performs the following calculation:\n",
+        "\n",
+        "`x = (x - x_min) / (x_max)`\n",
+        "\n",
+        "To scale the data, use the `ScaleTo01` data processing transform in 
`MLTransform`."

Review Comment:
   consider adding link to transform doc
   ```suggestion
           "To scale the data, use the [`ScaleTo01`]() data processing 
transform in `MLTransform`."
   ```



##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ],
+      "metadata": {
+        "id": "34gTXZ7BIArp"
+      },
+      "id": "34gTXZ7BIArp",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# MLTransform\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "0n0YAd-0KQyi"
+      },
+      "id": "0n0YAd-0KQyi"
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+      "metadata": {
+        "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+      },
+      "source": [
+        "This notebook demonstrates how to use `MLTransform` to preprocess 
your data for machine learning models. `MLTransform` is a `PTransform` that 
wraps multiple Apache Beam data processing transforms in one transform. As a 
result, `MLTransform` gives you the ability to preprocess different types of 
data in multiple ways with one transform.\n",
+        "\n",
+        "This notebook uses data processing transforms defined in the 
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
 module. For a full list of available transforms, see the 
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n";
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+      "metadata": {
+        "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+      },
+      "source": [
+        "## Import the requried modules.\n",
+        "\n",
+        "To use `MLTransfrom`, install `tensorflow_transform` and the Apache 
Beam SDK version 2.50.0 or later.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install tensorflow_transform --quiet\n",
+        "!pip install apache_beam>=2.50.0 --quiet"
+      ],
+      "metadata": {
+        "id": "MRWkC-n2DmjM"
+      },
+      "id": "MRWkC-n2DmjM",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+      "metadata": {
+        "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+      },
+      "outputs": [],
+      "source": [
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.tft import 
ComputeAndApplyVocabulary\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions\n",
+        "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Artifacts are additional data elements created by data 
transformations. Examples of artifacts are the `minimum` and `maximum` values 
from a `ScaleTo01` transformation, or the `mean` and `variance` from a 
`ScaleToZScore` transformation. Look at 
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more 
details on artifacts.\n",
+        "\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "90nXXc_A4Bmf"
+      },
+      "id": "90nXXc_A4Bmf"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+      "metadata": {
+        "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+      },
+      "outputs": [],
+      "source": [
+        "# store artifacts generated by MLTransform.\n",
+        "# Each MLTransform instance requires an artifact location to be 
empty.\n",
+        "# We use this method to delete and refresh the artifact location for 
each example.\n",
+        "artifact_location = './my_artifacts'\n",
+        "def delete_artifact_location(artifact_location):\n",
+        "  import shutil\n",
+        "  import os\n",
+        "  if os.path.exists(artifact_location):\n",
+        "      shutil.rmtree(artifact_location)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+      "metadata": {
+        "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+      },
+      "source": [
+        "## Compute and map the vocabulary\n",
+        "\n",
+        "\n",
+        "`ComputeAndApplyVocabulary` is a data processing transform that 
computes a unique vocabulary from a dataset and then maps each word or token to 
a distinct integer index. It facilitates transforming textual data into 
numerical representations for machine learning tasks.\n",
+        "\n",
+        "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+        "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n",
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Row(x=array([1, 0, 4]))\n",
+            "Row(x=array([1, 0, 6, 2, 3, 5]))\n"
+          ]
+        }
+      ],
+      "source": [
+        "delete_artifact_location(artifact_location)\n",
+        "\n",
+        "data = [\n",
+        "    {'x': ['I', 'love', 'pie']},\n",
+        "    {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+        "]\n",
+        "options = PipelineOptions()\n",
+        "with beam.Pipeline(options=options) as p:\n",
+        "    data = (\n",
+        "        p\n",
+        "        | 'CreateData' >> beam.Create(data)\n",
+        "        | 'MLTransform' >> 
MLTransform(write_artifact_location=artifact_location).with_transform(ComputeAndApplyVocabulary(columns=['x']))\n",
+        "        | 'PrintResults' >> beam.Map(print)\n",
+        "    )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "1e133002-7229-459d-8e3c-b41f4d65e76d",
+      "metadata": {
+        "id": "1e133002-7229-459d-8e3c-b41f4d65e76d"
+      },
+      "source": [
+        "### Fetch vocabulary artifacs\n",
+        "\n",
+        "This example generates a file with all the vocabulary in the dataset, 
referred to in `MLTransform` as an artifact. To fetch artifacts generated by 
the `ComputeAndApplyVocabulary` transform, use the `ArtifactsFetcher` class. 
This class fetches both a vocabulary list and a path to the vocabulary file."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+        "outputId": "cd8b6cf3-6093-4b1b-a063-ff327c090a92"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "['love', 'I', 'to', 'the', 'pie', 'park', 'going']\n",
+            "./my_artifacts/transform_fn/assets/compute_and_apply_vocab\n",
+            "7\n"
+          ]
+        }
+      ],
+      "source": [
+        "fetcher = ArtifactsFetcher(artifact_location=artifact_location)\n",
+        "# get vocab list\n",
+        "vocab_list = fetcher.get_vocab_list()\n",
+        "print(vocab_list)\n",
+        "# get vocab file path\n",
+        "vocab_file_path = fetcher.get_vocab_filepath()\n",
+        "print(vocab_file_path)\n",
+        "# get vocab size\n",
+        "vocab_size = fetcher.get_vocab_size()\n",
+        "print(vocab_size)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5f955f3d-3192-42f7-aa55-48249223418d",
+      "metadata": {
+        "id": "5f955f3d-3192-42f7-aa55-48249223418d"
+      },
+      "source": [
+        "## TFIDF\n",
+        "\n",
+        "TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical 
statistic used in text processing to reflect how important a word is to a 
document in a collection or corpus. It balances the frequency of a word in a 
document against its frequency in the entire corpus, giving higher value to 
more specific terms.\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85",
+      "metadata": {
+        "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85"
+      },
+      "outputs": [],
+      "source": [
+        "from apache_beam.ml.transforms.tft import TFIDF"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+        "outputId": "e87409ed-5e33-43fa-d3b6-a0c012636cef"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n",
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:Analyzer 
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key 
varies on repeated tracing. This warning is safe to ignore if you either 
specify `name` for all analyzers or if the order in which they are invoked is 
deterministic. If not, please file a bug with details.\n",
+            "WARNING:absl:Analyzer 
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key 
varies on repeated tracing. This warning is safe to ignore if you either 
specify `name` for all analyzers or if the order in which they are invoked is 
deterministic. If not, please file a bug with details.\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Row(x=array([1, 0, 4]), x_tfidf_weight=array([0.33333334, 
0.33333334, 0.4684884 ], dtype=float32), x_vocab_index=array([0, 1, 4]))\n",
+            "Row(x=array([1, 0, 6, 2, 3, 5]), 
x_tfidf_weight=array([0.16666667, 0.16666667, 0.2342442 , 0.2342442 , 0.2342442 
,\n",
+            "       0.2342442 ], dtype=float32), x_vocab_index=array([0, 1, 2, 
3, 5, 6]))\n"
+          ]
+        }
+      ],
+      "source": [
+        "data = [\n",
+        "    {'x': ['I', 'love', 'pie']},\n",
+        "    {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+        "]\n",
+        "delete_artifact_location(artifact_location)\n",
+        "options = PipelineOptions()\n",
+        "with beam.Pipeline(options=options) as p:\n",
+        "    data = (\n",
+        "        p\n",
+        "        | beam.Create(data)\n",
+        "        | MLTransform(write_artifact_location=artifact_location\n",
+        "                     
).with_transform(ComputeAndApplyVocabulary(columns=['x'])\n",
+        "                     ).with_transform(TFIDF(columns=['x']))\n",
+        "    )\n",
+        "    _ = data | beam.Map(print)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf",
+      "metadata": {
+        "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf"
+      },
+      "source": [
+        "TFIDF provides two outputs. These outputs appear as columns in the 
output file. One output column has the suffix `tfidf_weight`, and the other 
column has the suffix `vocab_index`.\n",
+        "\n",
+        "- `vocab_index`: indices of the words computed in the 
`ComputeAndApplyVocabulary` transform.\n",
+        "- `tfidif_weight`: the weight for each vocabulary index. The weight 
represents how important the word present at that `vocab_index` is to the 
document.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db",
+      "metadata": {
+        "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db"
+      },
+      "source": [
+        "## Scale the data"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4",
+      "metadata": {
+        "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4"
+      },
+      "source": [
+        "### Scale the data between 0 to 1\n",
+        "\n",
+        "Scale the data so that it's in the range of 0 and 1. To scale the 
data, the transform calculates minimum and maximum values on the whole dataset, 
and then performs the following calculation:\n",
+        "\n",
+        "`x = (x - x_min) / (x_max)`\n",
+        "\n",
+        "To scale the data, use the `ScaleTo01` data processing transform in 
`MLTransform`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "841a8e1f-2f5b-4fd9-bb35-12a2393922de",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "841a8e1f-2f5b-4fd9-bb35-12a2393922de",
+        "outputId": "efcae38d-96f6-4394-e5f5-c36644d3a9ff"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Row(x=array([0.        , 0.01010101, 0.02020202], dtype=float32), 
x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))\n",
+            "Row(x=array([0.03030303, 0.04040404, 0.06060606], dtype=float32), 
x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))\n",
+            "Row(x=array([0.09090909, 0.01010101, 0.09090909, 0.33333334, 1.   
     ,\n",
+            "       0.53535354, 0.1919192 , 0.09090909, 0.01010101, 
0.02020202,\n",
+            "       0.1010101 , 0.11111111], dtype=float32), 
x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))\n"
+          ]
+        }
+      ],
+      "source": [
+        "delete_artifact_location(artifact_location)\n",
+        "\n",
+        "from apache_beam.ml.transforms.tft import ScaleTo01\n",
+        "data = [\n",
+        "    {'x': [1, 2, 3]}, {'x': [4, 5, 7]}, {'x': [10, 2, 10, 34, 100, 
54, 20, 10, 2, 3, 11, 12]}]\n",
+        "\n",
+        "# delete_artifact_location(artifact_location)\n",

Review Comment:
   we can remove the comment here I think



##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ],
+      "metadata": {
+        "id": "34gTXZ7BIArp"
+      },
+      "id": "34gTXZ7BIArp",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# MLTransform\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "0n0YAd-0KQyi"
+      },
+      "id": "0n0YAd-0KQyi"
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+      "metadata": {
+        "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+      },
+      "source": [
+        "This notebook demonstrates how to use `MLTransform` to preprocess 
your data for machine learning models. `MLTransform` is a `PTransform` that 
wraps multiple Apache Beam data processing transforms in one transform. As a 
result, `MLTransform` gives you the ability to preprocess different types of 
data in multiple ways with one transform.\n",
+        "\n",
+        "This notebook uses data processing transforms defined in the 
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
 module. For a full list of available transforms, see the 
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n";
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+      "metadata": {
+        "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+      },
+      "source": [
+        "## Import the requried modules.\n",
+        "\n",
+        "To use `MLTransfrom`, install `tensorflow_transform` and the Apache 
Beam SDK version 2.50.0 or later.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install tensorflow_transform --quiet\n",
+        "!pip install apache_beam>=2.50.0 --quiet"
+      ],
+      "metadata": {
+        "id": "MRWkC-n2DmjM"
+      },
+      "id": "MRWkC-n2DmjM",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+      "metadata": {
+        "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+      },
+      "outputs": [],
+      "source": [
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.tft import 
ComputeAndApplyVocabulary\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions\n",
+        "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Artifacts are additional data elements created by data 
transformations. Examples of artifacts are the `minimum` and `maximum` values 
from a `ScaleTo01` transformation, or the `mean` and `variance` from a 
`ScaleToZScore` transformation. Look at 
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more 
details on artifacts.\n",
+        "\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "90nXXc_A4Bmf"
+      },
+      "id": "90nXXc_A4Bmf"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+      "metadata": {
+        "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+      },
+      "outputs": [],
+      "source": [
+        "# store artifacts generated by MLTransform.\n",
+        "# Each MLTransform instance requires an artifact location to be 
empty.\n",
+        "# We use this method to delete and refresh the artifact location for 
each example.\n",
+        "artifact_location = './my_artifacts'\n",
+        "def delete_artifact_location(artifact_location):\n",
+        "  import shutil\n",
+        "  import os\n",
+        "  if os.path.exists(artifact_location):\n",
+        "      shutil.rmtree(artifact_location)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+      "metadata": {
+        "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+      },
+      "source": [
+        "## Compute and map the vocabulary\n",
+        "\n",
+        "\n",
+        "`ComputeAndApplyVocabulary` is a data processing transform that 
computes a unique vocabulary from a dataset and then maps each word or token to 
a distinct integer index. It facilitates transforming textual data into 
numerical representations for machine learning tasks.\n",
+        "\n",
+        "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+        "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are passing instance dicts and DatasetMetadata 
to TFT which will not provide optimal performance. Consider following the TFT 
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+            "WARNING:absl:You are outputting instance dicts from 
`TransformDataset` which will not provide optimal performance. Consider setting 
 `output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow 
RecordBatch). Encoding functionality in this module works with both formats.\n",
+            "WARNING:apache_beam.options.pipeline_options:Discarding 
unparseable args: ['-f', 
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+          ]
+        },

Review Comment:
   consider removing irrelevant warnings like this manually



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] riteshghorse commented on a diff in pull request #27402: MLTransform basic notebook

Reply via email to