riteshghorse commented on code in PR #27402:
URL: https://github.com/apache/beam/pull/27402#discussion_r1321898989
##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ],
+ "metadata": {
+ "id": "34gTXZ7BIArp"
+ },
+ "id": "34gTXZ7BIArp",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# MLTransform\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "0n0YAd-0KQyi"
+ },
+ "id": "0n0YAd-0KQyi"
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+ "metadata": {
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+ },
+ "source": [
+ "This notebook demonstrates how to use `MLTransform` to preprocess
your data for machine learning models. `MLTransform` is a `PTransform` that
wraps multiple Apache Beam data processing transforms in one transform. As a
result, `MLTransform` gives you the ability to preprocess different types of
data in multiple ways with one transform.\n",
+ "\n",
+ "This notebook uses data processing transforms defined in the
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
module. For a full list of available transforms, see the
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+ "metadata": {
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+ },
+ "source": [
+ "## Import the requried modules.\n",
+ "\n",
+ "To use `MLTransfrom`, install `tensorflow_transform` and the Apache
Beam SDK version 2.50.0 or later.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install tensorflow_transform --quiet\n",
+ "!pip install apache_beam>=2.50.0 --quiet"
+ ],
+ "metadata": {
+ "id": "MRWkC-n2DmjM"
+ },
+ "id": "MRWkC-n2DmjM",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+ "metadata": {
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+ },
+ "outputs": [],
+ "source": [
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.tft import
ComputeAndApplyVocabulary\n",
+ "from apache_beam.options.pipeline_options import PipelineOptions\n",
+ "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Artifacts are additional data elements created by data
transformations. Examples of artifacts are the `minimum` and `maximum` values
from a `ScaleTo01` transformation, or the `mean` and `variance` from a
`ScaleToZScore` transformation. Look at
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more
details on artifacts.\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "90nXXc_A4Bmf"
+ },
+ "id": "90nXXc_A4Bmf"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+ "metadata": {
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+ },
+ "outputs": [],
+ "source": [
+ "# store artifacts generated by MLTransform.\n",
+ "# Each MLTransform instance requires an artifact location to be
empty.\n",
+ "# We use this method to delete and refresh the artifact location for
each example.\n",
+ "artifact_location = './my_artifacts'\n",
+ "def delete_artifact_location(artifact_location):\n",
+ " import shutil\n",
+ " import os\n",
+ " if os.path.exists(artifact_location):\n",
+ " shutil.rmtree(artifact_location)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+ "metadata": {
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+ },
+ "source": [
+ "## Compute and map the vocabulary\n",
+ "\n",
+ "\n",
+ "`ComputeAndApplyVocabulary` is a data processing transform that
computes a unique vocabulary from a dataset and then maps each word or token to
a distinct integer index. It facilitates transforming textual data into
numerical representations for machine learning tasks.\n",
Review Comment:
Add link to transforms if possible, could be helpful for users to quickly
take a look
```suggestion
"[`ComputeAndApplyVocabulary`]() is a data processing transform that
computes a unique vocabulary from a dataset and then maps each word or token to
a distinct integer index. It facilitates transforming textual data into
numerical representations for machine learning tasks.\n",
```
##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ],
+ "metadata": {
+ "id": "34gTXZ7BIArp"
+ },
+ "id": "34gTXZ7BIArp",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# MLTransform\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "0n0YAd-0KQyi"
+ },
+ "id": "0n0YAd-0KQyi"
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+ "metadata": {
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+ },
+ "source": [
+ "This notebook demonstrates how to use `MLTransform` to preprocess
your data for machine learning models. `MLTransform` is a `PTransform` that
wraps multiple Apache Beam data processing transforms in one transform. As a
result, `MLTransform` gives you the ability to preprocess different types of
data in multiple ways with one transform.\n",
+ "\n",
+ "This notebook uses data processing transforms defined in the
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
module. For a full list of available transforms, see the
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+ "metadata": {
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+ },
+ "source": [
+ "## Import the requried modules.\n",
+ "\n",
+ "To use `MLTransfrom`, install `tensorflow_transform` and the Apache
Beam SDK version 2.50.0 or later.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install tensorflow_transform --quiet\n",
+ "!pip install apache_beam>=2.50.0 --quiet"
+ ],
+ "metadata": {
+ "id": "MRWkC-n2DmjM"
+ },
+ "id": "MRWkC-n2DmjM",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+ "metadata": {
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+ },
+ "outputs": [],
+ "source": [
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.tft import
ComputeAndApplyVocabulary\n",
+ "from apache_beam.options.pipeline_options import PipelineOptions\n",
+ "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Artifacts are additional data elements created by data
transformations. Examples of artifacts are the `minimum` and `maximum` values
from a `ScaleTo01` transformation, or the `mean` and `variance` from a
`ScaleToZScore` transformation. Look at
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more
details on artifacts.\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "90nXXc_A4Bmf"
+ },
+ "id": "90nXXc_A4Bmf"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+ "metadata": {
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+ },
+ "outputs": [],
+ "source": [
+ "# store artifacts generated by MLTransform.\n",
+ "# Each MLTransform instance requires an artifact location to be
empty.\n",
+ "# We use this method to delete and refresh the artifact location for
each example.\n",
+ "artifact_location = './my_artifacts'\n",
+ "def delete_artifact_location(artifact_location):\n",
+ " import shutil\n",
+ " import os\n",
+ " if os.path.exists(artifact_location):\n",
+ " shutil.rmtree(artifact_location)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+ "metadata": {
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+ },
+ "source": [
+ "## Compute and map the vocabulary\n",
+ "\n",
+ "\n",
+ "`ComputeAndApplyVocabulary` is a data processing transform that
computes a unique vocabulary from a dataset and then maps each word or token to
a distinct integer index. It facilitates transforming textual data into
numerical representations for machine learning tasks.\n",
+ "\n",
+ "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n",
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Row(x=array([1, 0, 4]))\n",
+ "Row(x=array([1, 0, 6, 2, 3, 5]))\n"
+ ]
+ }
+ ],
+ "source": [
+ "delete_artifact_location(artifact_location)\n",
+ "\n",
+ "data = [\n",
+ " {'x': ['I', 'love', 'pie']},\n",
+ " {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+ "]\n",
+ "options = PipelineOptions()\n",
+ "with beam.Pipeline(options=options) as p:\n",
+ " data = (\n",
+ " p\n",
+ " | 'CreateData' >> beam.Create(data)\n",
+ " | 'MLTransform' >>
MLTransform(write_artifact_location=artifact_location).with_transform(ComputeAndApplyVocabulary(columns=['x']))\n",
+ " | 'PrintResults' >> beam.Map(print)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1e133002-7229-459d-8e3c-b41f4d65e76d",
+ "metadata": {
+ "id": "1e133002-7229-459d-8e3c-b41f4d65e76d"
+ },
+ "source": [
+ "### Fetch vocabulary artifacs\n",
+ "\n",
+ "This example generates a file with all the vocabulary in the dataset,
referred to in `MLTransform` as an artifact. To fetch artifacts generated by
the `ComputeAndApplyVocabulary` transform, use the `ArtifactsFetcher` class.
This class fetches both a vocabulary list and a path to the vocabulary file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+ "outputId": "cd8b6cf3-6093-4b1b-a063-ff327c090a92"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['love', 'I', 'to', 'the', 'pie', 'park', 'going']\n",
+ "./my_artifacts/transform_fn/assets/compute_and_apply_vocab\n",
+ "7\n"
+ ]
+ }
+ ],
+ "source": [
+ "fetcher = ArtifactsFetcher(artifact_location=artifact_location)\n",
+ "# get vocab list\n",
+ "vocab_list = fetcher.get_vocab_list()\n",
+ "print(vocab_list)\n",
+ "# get vocab file path\n",
+ "vocab_file_path = fetcher.get_vocab_filepath()\n",
+ "print(vocab_file_path)\n",
+ "# get vocab size\n",
+ "vocab_size = fetcher.get_vocab_size()\n",
+ "print(vocab_size)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f955f3d-3192-42f7-aa55-48249223418d",
+ "metadata": {
+ "id": "5f955f3d-3192-42f7-aa55-48249223418d"
+ },
+ "source": [
+ "## TFIDF\n",
+ "\n",
+ "TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical
statistic used in text processing to reflect how important a word is to a
document in a collection or corpus. It balances the frequency of a word in a
document against its frequency in the entire corpus, giving higher value to
more specific terms.\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85",
+ "metadata": {
+ "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85"
+ },
+ "outputs": [],
+ "source": [
+ "from apache_beam.ml.transforms.tft import TFIDF"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+ "outputId": "e87409ed-5e33-43fa-d3b6-a0c012636cef"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n",
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:Analyzer
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key
varies on repeated tracing. This warning is safe to ignore if you either
specify `name` for all analyzers or if the order in which they are invoked is
deterministic. If not, please file a bug with details.\n",
+ "WARNING:absl:Analyzer
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key
varies on repeated tracing. This warning is safe to ignore if you either
specify `name` for all analyzers or if the order in which they are invoked is
deterministic. If not, please file a bug with details.\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Row(x=array([1, 0, 4]), x_tfidf_weight=array([0.33333334,
0.33333334, 0.4684884 ], dtype=float32), x_vocab_index=array([0, 1, 4]))\n",
+ "Row(x=array([1, 0, 6, 2, 3, 5]),
x_tfidf_weight=array([0.16666667, 0.16666667, 0.2342442 , 0.2342442 , 0.2342442
,\n",
+ " 0.2342442 ], dtype=float32), x_vocab_index=array([0, 1, 2,
3, 5, 6]))\n"
+ ]
+ }
+ ],
+ "source": [
+ "data = [\n",
+ " {'x': ['I', 'love', 'pie']},\n",
+ " {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+ "]\n",
+ "delete_artifact_location(artifact_location)\n",
+ "options = PipelineOptions()\n",
+ "with beam.Pipeline(options=options) as p:\n",
+ " data = (\n",
+ " p\n",
+ " | beam.Create(data)\n",
+ " | MLTransform(write_artifact_location=artifact_location\n",
+ "
).with_transform(ComputeAndApplyVocabulary(columns=['x'])\n",
+ " ).with_transform(TFIDF(columns=['x']))\n",
+ " )\n",
+ " _ = data | beam.Map(print)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf",
+ "metadata": {
+ "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf"
+ },
+ "source": [
+ "TFIDF provides two outputs. These outputs appear as columns in the
output file. One output column has the suffix `tfidf_weight`, and the other
column has the suffix `vocab_index`.\n",
+ "\n",
+ "- `vocab_index`: indices of the words computed in the
`ComputeAndApplyVocabulary` transform.\n",
+ "- `tfidif_weight`: the weight for each vocabulary index. The weight
represents how important the word present at that `vocab_index` is to the
document.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db",
+ "metadata": {
+ "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db"
+ },
+ "source": [
+ "## Scale the data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4",
+ "metadata": {
+ "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4"
+ },
+ "source": [
+ "### Scale the data between 0 to 1\n",
+ "\n",
+ "Scale the data so that it's in the range of 0 and 1. To scale the
data, the transform calculates minimum and maximum values on the whole dataset,
and then performs the following calculation:\n",
+ "\n",
+ "`x = (x - x_min) / (x_max)`\n",
+ "\n",
+ "To scale the data, use the `ScaleTo01` data processing transform in
`MLTransform`."
Review Comment:
consider adding link to transform doc
```suggestion
"To scale the data, use the [`ScaleTo01`]() data processing
transform in `MLTransform`."
```
##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ],
+ "metadata": {
+ "id": "34gTXZ7BIArp"
+ },
+ "id": "34gTXZ7BIArp",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# MLTransform\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "0n0YAd-0KQyi"
+ },
+ "id": "0n0YAd-0KQyi"
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+ "metadata": {
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+ },
+ "source": [
+ "This notebook demonstrates how to use `MLTransform` to preprocess
your data for machine learning models. `MLTransform` is a `PTransform` that
wraps multiple Apache Beam data processing transforms in one transform. As a
result, `MLTransform` gives you the ability to preprocess different types of
data in multiple ways with one transform.\n",
+ "\n",
+ "This notebook uses data processing transforms defined in the
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
module. For a full list of available transforms, see the
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+ "metadata": {
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+ },
+ "source": [
+ "## Import the requried modules.\n",
+ "\n",
+ "To use `MLTransfrom`, install `tensorflow_transform` and the Apache
Beam SDK version 2.50.0 or later.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install tensorflow_transform --quiet\n",
+ "!pip install apache_beam>=2.50.0 --quiet"
+ ],
+ "metadata": {
+ "id": "MRWkC-n2DmjM"
+ },
+ "id": "MRWkC-n2DmjM",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+ "metadata": {
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+ },
+ "outputs": [],
+ "source": [
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.tft import
ComputeAndApplyVocabulary\n",
+ "from apache_beam.options.pipeline_options import PipelineOptions\n",
+ "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Artifacts are additional data elements created by data
transformations. Examples of artifacts are the `minimum` and `maximum` values
from a `ScaleTo01` transformation, or the `mean` and `variance` from a
`ScaleToZScore` transformation. Look at
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more
details on artifacts.\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "90nXXc_A4Bmf"
+ },
+ "id": "90nXXc_A4Bmf"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+ "metadata": {
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+ },
+ "outputs": [],
+ "source": [
+ "# store artifacts generated by MLTransform.\n",
+ "# Each MLTransform instance requires an artifact location to be
empty.\n",
+ "# We use this method to delete and refresh the artifact location for
each example.\n",
+ "artifact_location = './my_artifacts'\n",
+ "def delete_artifact_location(artifact_location):\n",
+ " import shutil\n",
+ " import os\n",
+ " if os.path.exists(artifact_location):\n",
+ " shutil.rmtree(artifact_location)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+ "metadata": {
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+ },
+ "source": [
+ "## Compute and map the vocabulary\n",
+ "\n",
+ "\n",
+ "`ComputeAndApplyVocabulary` is a data processing transform that
computes a unique vocabulary from a dataset and then maps each word or token to
a distinct integer index. It facilitates transforming textual data into
numerical representations for machine learning tasks.\n",
+ "\n",
+ "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n",
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Row(x=array([1, 0, 4]))\n",
+ "Row(x=array([1, 0, 6, 2, 3, 5]))\n"
+ ]
+ }
+ ],
+ "source": [
+ "delete_artifact_location(artifact_location)\n",
+ "\n",
+ "data = [\n",
+ " {'x': ['I', 'love', 'pie']},\n",
+ " {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+ "]\n",
+ "options = PipelineOptions()\n",
+ "with beam.Pipeline(options=options) as p:\n",
+ " data = (\n",
+ " p\n",
+ " | 'CreateData' >> beam.Create(data)\n",
+ " | 'MLTransform' >>
MLTransform(write_artifact_location=artifact_location).with_transform(ComputeAndApplyVocabulary(columns=['x']))\n",
+ " | 'PrintResults' >> beam.Map(print)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1e133002-7229-459d-8e3c-b41f4d65e76d",
+ "metadata": {
+ "id": "1e133002-7229-459d-8e3c-b41f4d65e76d"
+ },
+ "source": [
+ "### Fetch vocabulary artifacs\n",
+ "\n",
+ "This example generates a file with all the vocabulary in the dataset,
referred to in `MLTransform` as an artifact. To fetch artifacts generated by
the `ComputeAndApplyVocabulary` transform, use the `ArtifactsFetcher` class.
This class fetches both a vocabulary list and a path to the vocabulary file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "9c5fe46a-c718-4a82-bad8-aa091c0b0538",
+ "outputId": "cd8b6cf3-6093-4b1b-a063-ff327c090a92"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['love', 'I', 'to', 'the', 'pie', 'park', 'going']\n",
+ "./my_artifacts/transform_fn/assets/compute_and_apply_vocab\n",
+ "7\n"
+ ]
+ }
+ ],
+ "source": [
+ "fetcher = ArtifactsFetcher(artifact_location=artifact_location)\n",
+ "# get vocab list\n",
+ "vocab_list = fetcher.get_vocab_list()\n",
+ "print(vocab_list)\n",
+ "# get vocab file path\n",
+ "vocab_file_path = fetcher.get_vocab_filepath()\n",
+ "print(vocab_file_path)\n",
+ "# get vocab size\n",
+ "vocab_size = fetcher.get_vocab_size()\n",
+ "print(vocab_size)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f955f3d-3192-42f7-aa55-48249223418d",
+ "metadata": {
+ "id": "5f955f3d-3192-42f7-aa55-48249223418d"
+ },
+ "source": [
+ "## TFIDF\n",
+ "\n",
+ "TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical
statistic used in text processing to reflect how important a word is to a
document in a collection or corpus. It balances the frequency of a word in a
document against its frequency in the entire corpus, giving higher value to
more specific terms.\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85",
+ "metadata": {
+ "id": "8a8cb94b-57eb-4c4c-aa4c-22cf3193ea85"
+ },
+ "outputs": [],
+ "source": [
+ "from apache_beam.ml.transforms.tft import TFIDF"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "970d7222-194e-460e-b698-a00f1fcafb95",
+ "outputId": "e87409ed-5e33-43fa-d3b6-a0c012636cef"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n",
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:Analyzer
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key
varies on repeated tracing. This warning is safe to ignore if you either
specify `name` for all analyzers or if the order in which they are invoked is
deterministic. If not, please file a bug with details.\n",
+ "WARNING:absl:Analyzer
(tfidf/sum/temporary_analyzer_output/PlaceholderWithDefault:0) node's cache key
varies on repeated tracing. This warning is safe to ignore if you either
specify `name` for all analyzers or if the order in which they are invoked is
deterministic. If not, please file a bug with details.\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Row(x=array([1, 0, 4]), x_tfidf_weight=array([0.33333334,
0.33333334, 0.4684884 ], dtype=float32), x_vocab_index=array([0, 1, 4]))\n",
+ "Row(x=array([1, 0, 6, 2, 3, 5]),
x_tfidf_weight=array([0.16666667, 0.16666667, 0.2342442 , 0.2342442 , 0.2342442
,\n",
+ " 0.2342442 ], dtype=float32), x_vocab_index=array([0, 1, 2,
3, 5, 6]))\n"
+ ]
+ }
+ ],
+ "source": [
+ "data = [\n",
+ " {'x': ['I', 'love', 'pie']},\n",
+ " {'x': ['I', 'love', 'going', 'to', 'the', 'park']}\n",
+ "]\n",
+ "delete_artifact_location(artifact_location)\n",
+ "options = PipelineOptions()\n",
+ "with beam.Pipeline(options=options) as p:\n",
+ " data = (\n",
+ " p\n",
+ " | beam.Create(data)\n",
+ " | MLTransform(write_artifact_location=artifact_location\n",
+ "
).with_transform(ComputeAndApplyVocabulary(columns=['x'])\n",
+ " ).with_transform(TFIDF(columns=['x']))\n",
+ " )\n",
+ " _ = data | beam.Map(print)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf",
+ "metadata": {
+ "id": "7b1feb4f-bb0b-4f61-8349-e1ba411858cf"
+ },
+ "source": [
+ "TFIDF provides two outputs. These outputs appear as columns in the
output file. One output column has the suffix `tfidf_weight`, and the other
column has the suffix `vocab_index`.\n",
+ "\n",
+ "- `vocab_index`: indices of the words computed in the
`ComputeAndApplyVocabulary` transform.\n",
+ "- `tfidif_weight`: the weight for each vocabulary index. The weight
represents how important the word present at that `vocab_index` is to the
document.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db",
+ "metadata": {
+ "id": "d3b5b9dd-ed35-460b-9fb3-0ffb5c3633db"
+ },
+ "source": [
+ "## Scale the data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4",
+ "metadata": {
+ "id": "3bd20692-6d14-4ece-a2e7-69a2a6fac5d4"
+ },
+ "source": [
+ "### Scale the data between 0 to 1\n",
+ "\n",
+ "Scale the data so that it's in the range of 0 and 1. To scale the
data, the transform calculates minimum and maximum values on the whole dataset,
and then performs the following calculation:\n",
+ "\n",
+ "`x = (x - x_min) / (x_max)`\n",
+ "\n",
+ "To scale the data, use the `ScaleTo01` data processing transform in
`MLTransform`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "841a8e1f-2f5b-4fd9-bb35-12a2393922de",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "841a8e1f-2f5b-4fd9-bb35-12a2393922de",
+ "outputId": "efcae38d-96f6-4394-e5f5-c36644d3a9ff"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Row(x=array([0. , 0.01010101, 0.02020202], dtype=float32),
x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))\n",
+ "Row(x=array([0.03030303, 0.04040404, 0.06060606], dtype=float32),
x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))\n",
+ "Row(x=array([0.09090909, 0.01010101, 0.09090909, 0.33333334, 1.
,\n",
+ " 0.53535354, 0.1919192 , 0.09090909, 0.01010101,
0.02020202,\n",
+ " 0.1010101 , 0.11111111], dtype=float32),
x_max=array([100.], dtype=float32), x_min=array([1.], dtype=float32))\n"
+ ]
+ }
+ ],
+ "source": [
+ "delete_artifact_location(artifact_location)\n",
+ "\n",
+ "from apache_beam.ml.transforms.tft import ScaleTo01\n",
+ "data = [\n",
+ " {'x': [1, 2, 3]}, {'x': [4, 5, 7]}, {'x': [10, 2, 10, 34, 100,
54, 20, 10, 2, 3, 11, 12]}]\n",
+ "\n",
+ "# delete_artifact_location(artifact_location)\n",
Review Comment:
we can remove the comment here I think
##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ],
+ "metadata": {
+ "id": "34gTXZ7BIArp"
+ },
+ "id": "34gTXZ7BIArp",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# MLTransform\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "0n0YAd-0KQyi"
+ },
+ "id": "0n0YAd-0KQyi"
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+ "metadata": {
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+ },
+ "source": [
+ "This notebook demonstrates how to use `MLTransform` to preprocess
your data for machine learning models. `MLTransform` is a `PTransform` that
wraps multiple Apache Beam data processing transforms in one transform. As a
result, `MLTransform` gives you the ability to preprocess different types of
data in multiple ways with one transform.\n",
+ "\n",
+ "This notebook uses data processing transforms defined in the
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
module. For a full list of available transforms, see the
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+ "metadata": {
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+ },
+ "source": [
+ "## Import the requried modules.\n",
+ "\n",
+ "To use `MLTransfrom`, install `tensorflow_transform` and the Apache
Beam SDK version 2.50.0 or later.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install tensorflow_transform --quiet\n",
+ "!pip install apache_beam>=2.50.0 --quiet"
+ ],
+ "metadata": {
+ "id": "MRWkC-n2DmjM"
+ },
+ "id": "MRWkC-n2DmjM",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+ "metadata": {
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+ },
+ "outputs": [],
+ "source": [
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.tft import
ComputeAndApplyVocabulary\n",
+ "from apache_beam.options.pipeline_options import PipelineOptions\n",
+ "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Artifacts are additional data elements created by data
transformations. Examples of artifacts are the `minimum` and `maximum` values
from a `ScaleTo01` transformation, or the `mean` and `variance` from a
`ScaleToZScore` transformation. Look at
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more
details on artifacts.\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "90nXXc_A4Bmf"
+ },
+ "id": "90nXXc_A4Bmf"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+ "metadata": {
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+ },
+ "outputs": [],
+ "source": [
+ "# store artifacts generated by MLTransform.\n",
+ "# Each MLTransform instance requires an artifact location to be
empty.\n",
+ "# We use this method to delete and refresh the artifact location for
each example.\n",
+ "artifact_location = './my_artifacts'\n",
+ "def delete_artifact_location(artifact_location):\n",
+ " import shutil\n",
+ " import os\n",
+ " if os.path.exists(artifact_location):\n",
+ " shutil.rmtree(artifact_location)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+ "metadata": {
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+ },
+ "source": [
+ "## Compute and map the vocabulary\n",
+ "\n",
+ "\n",
+ "`ComputeAndApplyVocabulary` is a data processing transform that
computes a unique vocabulary from a dataset and then maps each word or token to
a distinct integer index. It facilitates transforming textual data into
numerical representations for machine learning tasks.\n",
+ "\n",
+ "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n",
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+ ]
+ },
Review Comment:
consider removing irrelevant warnings like this manually
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]