AnandInguva commented on code in PR #27402:
URL: https://github.com/apache/beam/pull/27402#discussion_r1322073355
##########
examples/notebooks/beam-ml/mltransform_basic.ipynb:
##########
@@ -0,0 +1,733 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/mltransform_notebook/examples/notebooks/beam-ml/mltransform_basic.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ],
+ "metadata": {
+ "id": "34gTXZ7BIArp"
+ },
+ "id": "34gTXZ7BIArp",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# MLTransform\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "0n0YAd-0KQyi"
+ },
+ "id": "0n0YAd-0KQyi"
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0",
+ "metadata": {
+ "id": "d3b81cf2-8603-42bd-995e-9e14631effd0"
+ },
+ "source": [
+ "This notebook demonstrates how to use `MLTransform` to preprocess
your data for machine learning models. `MLTransform` is a `PTransform` that
wraps multiple Apache Beam data processing transforms in one transform. As a
result, `MLTransform` gives you the ability to preprocess different types of
data in multiple ways with one transform.\n",
+ "\n",
+ "This notebook uses data processing transforms defined in the
[apache_beam/ml/transforms/tft](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html)
module. For a full list of available transforms, see the
https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01",
+ "metadata": {
+ "id": "f0097dbd-2657-4cbe-a334-e0401816db01"
+ },
+ "source": [
+ "## Import the requried modules.\n",
+ "\n",
+ "To use `MLTransfrom`, install `tensorflow_transform` and the Apache
Beam SDK version 2.50.0 or later.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install tensorflow_transform --quiet\n",
+ "!pip install apache_beam>=2.50.0 --quiet"
+ ],
+ "metadata": {
+ "id": "MRWkC-n2DmjM"
+ },
+ "id": "MRWkC-n2DmjM",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165",
+ "metadata": {
+ "id": "88ddd3a4-3643-4731-b99e-a5d697fbc165"
+ },
+ "outputs": [],
+ "source": [
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.tft import
ComputeAndApplyVocabulary\n",
+ "from apache_beam.options.pipeline_options import PipelineOptions\n",
+ "from apache_beam.ml.transforms.utils import ArtifactsFetcher"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Artifacts are additional data elements created by data
transformations. Examples of artifacts are the `minimum` and `maximum` values
from a `ScaleTo01` transformation, or the `mean` and `variance` from a
`ScaleToZScore` transformation. Look at
https://beam.apache.org/documentation/ml/preprocess-data/#artifacts for more
details on artifacts.\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "90nXXc_A4Bmf"
+ },
+ "id": "90nXXc_A4Bmf"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61",
+ "metadata": {
+ "id": "bdabbc57-ec98-4113-b37e-61962f488d61"
+ },
+ "outputs": [],
+ "source": [
+ "# store artifacts generated by MLTransform.\n",
+ "# Each MLTransform instance requires an artifact location to be
empty.\n",
+ "# We use this method to delete and refresh the artifact location for
each example.\n",
+ "artifact_location = './my_artifacts'\n",
+ "def delete_artifact_location(artifact_location):\n",
+ " import shutil\n",
+ " import os\n",
+ " if os.path.exists(artifact_location):\n",
+ " shutil.rmtree(artifact_location)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef",
+ "metadata": {
+ "id": "28b1719c-7287-4cec-870b-9fabc4c4a4ef"
+ },
+ "source": [
+ "## Compute and map the vocabulary\n",
+ "\n",
+ "\n",
+ "`ComputeAndApplyVocabulary` is a data processing transform that
computes a unique vocabulary from a dataset and then maps each word or token to
a distinct integer index. It facilitates transforming textual data into
numerical representations for machine learning tasks.\n",
+ "\n",
+ "Let's use `ComputeAndApplyVocabulary` with `MLTransform`\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "56d6d09a-8d34-444f-a1e4-a75624b36932",
+ "outputId": "2eb99e87-fb23-498c-ed08-775befa3a823"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are passing instance dicts and DatasetMetadata
to TFT which will not provide optimal performance. Consider following the TFT
guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n",
+ "WARNING:absl:You are outputting instance dicts from
`TransformDataset` which will not provide optimal performance. Consider setting
`output_record_batches=True` to upgrade to the TFXIO format (Apache Arrow
RecordBatch). Encoding functionality in this module works with both formats.\n",
+ "WARNING:apache_beam.options.pipeline_options:Discarding
unparseable args: ['-f',
'/root/.local/share/jupyter/runtime/kernel-eb509d9c-cd3a-4a27-ab40-19bbfa38a5ad.json']\n"
+ ]
+ },
Review Comment:
I will remove them. Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]