Re: [PR] Add notebooks for text embeddings [beam]

via GitHub Fri, 05 Jan 2024 10:03:47 -0800


rszper commented on code in PR #29893:
URL: https://github.com/apache/beam/pull/29893#discussion_r1443188053



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,404 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",

Review Comment:
   ```suggestion
           "## Text embeddings\n",
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,404 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",

Review Comment:
   ```suggestion
           "# Generate text embeddings by using Hugging Face Hub models\n",
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb:
##########
@@ -0,0 +1,316 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "muiqKarukWj0"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate text embeddings by using the Vertex AI API\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Text embeddings are a way to represent text as numerical vectors. 
This process lets computers understand and process text data, which is 
essential for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "## Uses of text embeddings\n",
+        "The following NLP tasks use embeddings:\n",
+        "\n",
+        "* **Semantic search:** Find documents or passages that are relevant 
to a query when the query doesn't use the exact same words as the documents.\n",
+        "* **Text classification:** Categorize text data into different 
classes, such as spam and not spam, or positive sentiment and negative 
sentiment.\n",
+        "* **Machine translation:** Translate text from one language to 
another and preserve the meaning.\n",
+        "* **Text summarization:** Create shorter summaries of text.\n",
+        "\n",
+        "This notebook generates embeddings from text data by using Apache 
Beam's `MLTransform` with the `Vertex AI` Python SDK.\n",

Review Comment:
   ```suggestion
           "This notebook generates embeddings from text data by using Apache 
Beam's `MLTransform` with the Vertex AI Python SDK.\n",
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,404 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Use text embeddings to represent text as numerical vectors. This 
process lets computers understand and process text data, which is essential for 
many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "The following NLP tasks use embeddings:\n",
+        "\n",
+        "* **Semantic search:** Find documents or passages that are relevant 
to a query when the query doesn't use the exact same words as the documents.\n",
+        "* **Text classification:** Categorize text data into different 
classes, such as spam and not spam, or positive sentiment and negative 
sentiment.\n",
+        "* **Machine translation:** Translate text from one language to 
another and preserve the meaning.\n",
+        "* **Text summarization:** Create shorter summaries of text.\n",
+        "\n",
+        "This notebook uses Apache Beam's `MLTransform` to generate embeddings 
from text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n"

Review Comment:
   ```suggestion
           "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use the `SentenceTransformerEmbeddings` module to specify the 
model configuration.\n"
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb:
##########
@@ -0,0 +1,316 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "muiqKarukWj0"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate text embeddings by using the Vertex AI API\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Text embeddings are a way to represent text as numerical vectors. 
This process lets computers understand and process text data, which is 
essential for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "## Uses of text embeddings\n",
+        "The following NLP tasks use embeddings:\n",
+        "\n",
+        "* **Semantic search:** Find documents or passages that are relevant 
to a query when the query doesn't use the exact same words as the documents.\n",
+        "* **Text classification:** Categorize text data into different 
classes, such as spam and not spam, or positive sentiment and negative 
sentiment.\n",
+        "* **Machine translation:** Translate text from one language to 
another and preserve the meaning.\n",
+        "* **Text summarization:** Create shorter summaries of text.\n",
+        "\n",
+        "This notebook generates embeddings from text data by using Apache 
Beam's `MLTransform` with the `Vertex AI` Python SDK.\n",
+        "\n",
+        "Use the Vertex AI text-embeddings API to generate text embeddings 
that use Google’s large generative artificial intelligence (AI) models. To 
generate text embeddings by using the Vertex AI text-embeddings API, use 
`MLTransform` with the `VertexAITextEmbeddings` class to specify the model 
configuration. For more information, see [Get text 
embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).
 \n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation.\n",
+        "\n",
+        "## Requirements\n",
+        "\n",
+        "To use the Vertex AI text-embeddings API, complete the following 
prerequisites:\n",
+        "\n",
+        "* Install the `google-cloud-aiplatform` Python package.\n",
+        "* Do one of the following tasks:\n",
+        "  * Configure credentials for your Google Cloud project. For more 
information, see [Google Auth Library for 
Python](https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#module-google.auth).\n",
+        "  * Store the path to a service account JSON file by using the 
[GOOGLE_APPLICATION_CREDENTIALS](https://cloud.google.com/docs/authentication/application-default-credentials#GAC)
 environment variable."
+      ],
+      "metadata": {
+        "id": "bkpSCGCWlqAf"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To use your Google Cloud account, authenticate this notebook."
+      ],
+      "metadata": {
+        "id": "W29FgO5Qv2ew"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from google.colab import auth\n",
+        "auth.authenticate_user()\n",
+        "\n",
+        "# TODO: Remove the project name before merging.\n",
+        "project = 'google.com:clouddfe' # Replace with a valid project id."
+      ],
+      "metadata": {
+        "id": "nYyyGYt3licq"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Install dependencies\n",
+        " Install Apache Beam and the dependencies required for the Vertex AI 
text-embeddings API."
+      ],
+      "metadata": {
+        "id": "UQROd16ZDN5y"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python[gcp]"
+      ],
+      "metadata": {
+        "id": "BTxob7d5DLBM"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.vertex_ai import 
VertexAITextEmbeddings"
+      ],
+      "metadata": {
+        "id": "SkMhR7H6n1P0"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Transform the data\n",
+        "\n",
+        "`MLTransform` is a `PTransform` that you can use for data 
preparation, including generating text embeddings.\n",
+        "\n",
+        "### Use MLTransform in write mode\n",
+        "\n",
+         "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. Then, when you run `MLTransform` in `read` 
mode, these transforms are used. This process ensures that you're applying the 
same preprocessing steps when you train your model and when you serve the model 
in production or test its accuracy.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "cokOaX2kzyke"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Get the data\n",
+        "\n",
+        "`MLTransform` processes dictionaries that include column names and 
their associated text data. To generate embeddings for specific columns, 
specify these column names in the `columns` argument of 
`VertexAITextEmbeddings`. This transform uses the the Vertex AI text-embeddings 
API for online predictions to generate an embeddings vector for each sentence"

Review Comment:
   ```suggestion
           "`MLTransform` processes dictionaries that include column names and 
their associated text data. To generate embeddings for specific columns, 
specify these column names in the `columns` argument of 
`VertexAITextEmbeddings`. This transform uses the the Vertex AI text-embeddings 
API for online predictions to generate an embeddings vector for each sentence."
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,404 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Use text embeddings to represent text as numerical vectors. This 
process lets computers understand and process text data, which is essential for 
many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "The following NLP tasks use embeddings:\n",
+        "\n",
+        "* **Semantic search:** Find documents or passages that are relevant 
to a query when the query doesn't use the exact same words as the documents.\n",
+        "* **Text classification:** Categorize text data into different 
classes, such as spam and not spam, or positive sentiment and negative 
sentiment.\n",
+        "* **Machine translation:** Translate text from one language to 
another and preserve the meaning.\n",
+        "* **Text summarization:** Create shorter summaries of text.\n",
+        "\n",
+        "This notebook uses Apache Beam's `MLTransform` to generate embeddings 
from text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n"
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Install dependencies\n",
+        "\n",
+        "Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings. The dependencies includes the `sentence-transformers` package, 
which is required to use the `SentenceTransformerEmbeddings` module.\n"
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": 28,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 29,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Process the data\n",
+        "\n",
+        "`MLTransform` is a `PTransform` that you can use for data 
preparation, including generating text embeddings.\n",
+        "\n",
+        "### Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. Then, when you run `MLTransform` in `read` 
mode, these transforms are used. This process ensures that you're applying the 
same preprocessing steps when you train your model and when you serve the model 
in production or test its accuracy.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Get the data\n",
+        "\n",
+        "The following text inputs come from the Hugging Face blog [Getting 
Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings).\n",
+        "\n",
+        "\n",
+        "`MLTransform` operates on dictionaries of data. To generate 
embeddings for specific columns, provide the column names as input to the 
`columns` argument in the `SentenceTransformerEmbeddings` package.\""
+      ],
+      "metadata": {
+        "id": "Dbkmu3HP6Kql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "content = [\n",
+        "    {'x': 'How do I get a replacement Medicare card?'},\n",
+        "    {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+        "    {'x': 'How do I terminate my Medicare Part B (medical 
insurance)?'},\n",
+        "    {'x': 'How do I sign up for Medicare?'},\n",
+        "    {'x': 'Can I sign up for Medicare Part B if I am working and have 
health insurance through an employer?'},\n",
+        "    {'x': 'How do I sign up for Medicare Part B if I already have 
Part A?'},\n",
+        "    {'x': 'What are Medicare late enrollment penalties?'},\n",
+        "    {'x': 'What is Medicare and who can get it?'},\n",
+        "    {'x': 'How can I get help with my Medicare Part A and Part B 
premiums?'},\n",
+        "    {'x': 'What are the different parts of Medicare?'},\n",
+        "    {'x': 'Will my Medicare premiums be higher because of my higher 
income?'},\n",
+        "    {'x': 'What is TRICARE ?'},\n",
+        "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' 
Benefits?\"}\n",
+        "]\n",
+        "\n",
+        "text_embedding_model_name = 
'sentence-transformers/sentence-t5-large'\n",
+        "\n",
+        "\n",
+        "# helper function that returns a dict containing only first\n",
+        "# ten elements of generated embeddings\n",
+        "def truncate_embeddings(d):\n",
+        "  for key in d.keys():\n",
+        "    d[key] = d[key][:10]\n",
+        "  return d"
+      ],
+      "metadata": {
+        "id": "LCTUs8F73iDg"
+      },
+      "execution_count": 30,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "### Generate text embeddings\n",
+        "This example uses the model `sentence-transformers/sentence-t5-large` 
to generate text embeddings. The model uses only the encoder from a `T5-large 
model`. The weights are stored in FP16. For more information about the model, 
see [Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text 
Models](https://arxiv.org/abs/2108.08877)."
+      ],
+      "metadata": {
+        "id": "SApMmlRLRv_e"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_t5 = tempfile.mkdtemp(prefix='huggingface_')\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'])\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "      pipeline\n",
+        "      | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_t5).with_transform(embedding_transform))\n",
+        "\n",
+        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> 
beam.Map(print)\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "SF6izkN134sf",
+        "outputId": "524d3506-d31f-4dee-9079-1ed6d7cadf1a"
+      },
+      "execution_count": 31,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'x': [-0.0317193828523159, -0.005265399813652039, 
-0.012499183416366577, 0.00018130357784684747, -0.005592408124357462, 
0.06207558885216713, -0.01656288281083107, 0.0167048592120409, 
-0.01239298190921545, 0.03041897714138031]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.015295305289328098, 0.005405726842582226, 
-0.015631258487701416, 0.022797023877501488, -0.027843449264764786, 
0.03968179598450661, -0.004387892782688141, 0.022909151390194893, 
0.01015392318367958, 0.04723235219717026]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.03450256213545799, -0.002632762538269162, 
-0.022460950538516045, -0.011689935810863972, -0.027329981327056885, 
0.07293087989091873, -0.03069353476166725, 0.05429817736148834, 
-0.01308195199817419, 0.017668722197413445]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02869587577879429, -0.0002648509689606726, 
-0.007186499424278736, -0.0003750955802388489, 0.012458174489438534, 
0.06721009314060211, -0.013404129073023796, 0.03204648941755295, 
-0.021021844819188118, 0.04968355968594551]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.03241290897130966, 0.006845517549663782, 
0.02001815102994442, -0.0057969288900494576, 0.008191823959350586, 
0.08160955458879471, -0.009215254336595535, 0.023534387350082397, 
-0.02034241147339344, 0.0357462577521801]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.04592451825737953, -0.0025395643897354603, 
-0.01178023498505354, 0.011568977497518063, -0.0029014083556830883, 
0.06971456110477448, -0.021167151629924774, 0.015902182087302208, 
-0.015007994137704372, 0.026213033124804497]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [0.005221465136855841, -0.002127869985997677, 
-0.002369001042097807, -0.019337018951773643, 0.023243796080350876, 
0.05599674955010414, -0.022721167653799057, 0.024813007563352585, 
-0.010685156099498272, 0.03624529018998146]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.035339221358299255, 0.010706206783652306, 
-0.001701260800473392, -0.00862252525985241, 0.006445988081395626, 
0.08198338001966476, -0.022678885608911514, 0.01434261817485094, 
-0.008092232048511505, 0.03345781937241554]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.030748076736927032, 0.009340512566268444, 
-0.013637945055961609, 0.011183148249983788, -0.013879665173590183, 
0.046350326389074326, -0.024090109393000603, 0.02885228954255581, 
-0.01699884608387947, 0.01672385260462761]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.040792081505060196, -0.00872269831597805, 
-0.015838179737329483, -0.03141209855675697, -7.104632823029533e-05, 
0.08301416039466858, -0.034691162407398224, 0.0026397297624498606, 
0.009255227632820606, 0.05415954813361168]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02156883291900158, 0.003969342447817326, 
-0.030446071177721024, 0.008231461979448795, -0.01271845493465662, 
0.03793857619166374, -0.013524272479116917, -0.0385628417134285, 
-0.0058258213102817535, 0.03505263477563858]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.027544165030121803, -0.01773364469408989, 
-0.013286487199366093, -0.008328652940690517, -0.011047529056668282, 
0.05237515643239021, -0.016948163509368896, 0.02806701697409153, 
-0.0018120920285582542, 0.027241172268986702]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.03464886546134949, -0.003521248232573271, 
-0.010239562019705772, -0.018618224188685417, 0.004094886127859354, 
0.062059685587882996, -0.013881963677704334, -0.0008639032603241503, 
-0.029874088242650032, 0.033531222492456436]}\n",
+            "Embedding shape: 10\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "You can pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",
+        "\n",
+        "When you pass `convert_to_numpy=False`, the output contains 
`torch.Tensor` matrices."
+      ],
+      "metadata": {
+        "id": "1MFom0PW_vRv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_t5_with_inference_args = 
tempfile.mkdtemp(prefix='huggingface_')\n",
+        "\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'],\n",
+        "        inference_args={'convert_to_numpy': False}\n",
+        "        )\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "      pipeline\n",
+        "      | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_t5_with_inference_args).with_transform(embedding_transform))\n",
+        "\n",
+        "  # The outputs are in the Pytorch tensor type.\n",

Review Comment:
   ```suggestion
           "  # The outputs are in the PyTorch tensor type.\n",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add notebooks for text embeddings [beam]

Reply via email to