Re: [PR] Add notebooks for text embeddings [beam]

via GitHub Tue, 02 Jan 2024 12:28:40 -0800


damccorm commented on code in PR #29893:
URL: https://github.com/apache/beam/pull/29893#discussion_r1439774797



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",

Review Comment:
   (optional): This has some redundancy with the previous sentence (outside of 
this header). Maybe just remove this line and say `Some NLP tasks that use text 
embeddings include:`



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."

Review Comment:
   This probably belongs in the next section. Maybe something like:
   
   ```
   Install Apache Beam and the dependencies needed to work with Hugging Face 
embeddings. This includes the `sentence-transformers package` which is required 
to use the `SentenceTransformerEmbeddings` module.
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To generate text embeddings with `MLTransform`, the following 
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text 
inputs from the Hugging Face blog [Getting Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."

Review Comment:
   I think the column piece might be less natural for people here, so maybe we 
could add a sentence that explains that MLTransform operates on columns 
specified in the `SentenceTransformerEmbeddings` column



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",

Review Comment:
   ```suggestion
           "* Text classification: Categorizng text data into different 
classes, such as spam or not spam, or positive sentiment or negative 
sentiment.\n",
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",

Review Comment:
   ```suggestion
           "In this notebook, we will use Apache Beam's `MLTransform` to 
generate embeddings from text data.\n",
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",

Review Comment:
   This section might benefit from a brief introduction of MLTransform (e.g. 
"MLTransform is a transform that can be used for a variety of machine learning 
pre- and post-processing operations and data preparation, including generating 
embeddings.")



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."

Review Comment:
   Nit: remove leading space



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",

Review Comment:
   and maybe after the `read` mode sentence we could add a small snippet about 
why read/write mode is useful (maybe just copy the first bullet in 
https://beam.apache.org/documentation/ml/preprocess-data/#use-mltransform)



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To generate text embeddings with `MLTransform`, the following 
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text 
inputs from the Hugging Face blog [Getting Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+      ],
+      "metadata": {
+        "id": "Dbkmu3HP6Kql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "content = [\n",
+        "    {'x': 'How do I get a replacement Medicare card?'},\n",
+        "    {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+        "    {'x': 'How do I terminate my Medicare Part B (medical 
insurance)?'},\n",
+        "    {'x': 'How do I sign up for Medicare?'},\n",
+        "    {'x': 'Can I sign up for Medicare Part B if I am working and have 
health insurance through an employer?'},\n",
+        "    {'x': 'How do I sign up for Medicare Part B if I already have 
Part A?'},\n",
+        "    {'x': 'What are Medicare late enrollment penalties?'},\n",
+        "    {'x': 'What is Medicare and who can get it?'},\n",
+        "    {'x': 'How can I get help with my Medicare Part A and Part B 
premiums?'},\n",
+        "    {'x': 'What are the different parts of Medicare?'},\n",
+        "    {'x': 'Will my Medicare premiums be higher because of my higher 
income?'},\n",
+        "    {'x': 'What is TRICARE ?'},\n",
+        "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' 
Benefits?\"}\n",
+        "]\n",
+        "\n",
+        "\n",
+        "# helper function that returns a dict containing only first\n",
+        "#10 elements of generated embeddings.\n",
+        "def truncate_embeddings(d):\n",
+        "  for key in d.keys():\n",
+        "    d[key] = d[key][:10]\n",
+        "  return d"
+      ],
+      "metadata": {
+        "id": "LCTUs8F73iDg"
+      },
+      "execution_count": 25,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+        "text_embedding_model_name = 
'sentence-transformers/all-MiniLM-L6-v2'\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'])\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+        "\n",
+        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> 
beam.Map(print)\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "SF6izkN134sf",
+        "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'x': [-0.023889463394880295, 0.05525851249694824, 
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726, 
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368, 
0.01850851997733116, -0.08350814878940582]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.01268761046230793, 0.04687413573265076, 
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727, 
0.04232167452573776, 0.016627851873636246, -0.004099288955330849, 
-0.0026070312596857548, -0.010187783278524876]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [0.0004943296662531793, 0.11941202729940414, 
0.005229473114013672, -0.09273427724838257, 0.007772865705192089, 
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786, 
-0.006264965515583754, -0.006110507529228926]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.029711326584219933, 0.02329839952290058, 
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828, 
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754, 
-0.04512352868914604, -0.040747467428445816]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02562842145562172, 0.070388562977314, 
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075, 
0.052822552621364594, 0.06706249713897705, -0.05261750519275665, 
-0.054702047258615494, -0.11623040586709976]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.022656124085187912, 0.021159743890166283, 
0.0051048519089818, -0.04649421200156212, 0.009073587134480476, 
0.04149482399225235, 0.0542682446539402, -0.02418488636612892, 
-0.013482789508998394, -0.07596635073423386]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0029113641940057278, 0.060791268944740295, 
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759, 
0.036593958735466, 0.002054463606327772, -0.03134453296661377, 
0.03180575743317604, -0.02349487692117691]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.08052562177181244, 0.05988812819123268, 
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941, 
0.04184781387448311, 0.11904510855674744, 0.010651882737874985, 
-0.030094878748059273, -0.004561211448162794]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0343877375125885, 0.07250142097473145, 
0.01443990133702755, -0.03669498860836029, 0.014018685556948185, 
0.06307007372379303, 0.03468254581093788, -0.014530746266245842, 
-0.05986189469695091, -0.04538322612643242]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.005963834468275309, 0.025043703615665436, 
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116, 
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823, 
-0.03821341320872307, -0.04114910215139389]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.039007965475320816, -0.010609461925923824, 
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563, 
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938, 
-0.014126974157989025, -0.061636749655008316]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.09598278254270554, -0.06301165372133255, 
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278, 
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741, 
-0.04905705526471138, -0.031649429351091385]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.011600406840443611, 0.05651004612445831, 
0.016623979434370995, -0.09469003975391388, -0.009865491650998592, 
0.07234735041856766, 0.04412448778748512, -0.0411749929189682, 
-0.04212445020675659, -0.10263106226921082]}\n",
+            "Embedding shape: 10\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",
+        "\n",
+        "By passing `convert_to_numpy=False`, the output will contain 
`torch.Tensor`s."
+      ],
+      "metadata": {
+        "id": "1MFom0PW_vRv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm_with_inference_args = 
tempfile.mkdtemp(prefix='huggingface_')\n",
+        "\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'],\n",
+        "        inference_args={'convert_to_numpy': False}\n",
+        "        )\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",
+        "\n",
+        "  # The outputs are in the Pytorch tensor type.\n",
+        "  transformed_pcoll | 'LogOutput' >> beam.Map(lambda x: 
print(type(x['x'])))\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))\n"
+      ],
+      "metadata": {
+        "id": "xyezKuzY_uLD",
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "outputId": "d09a07d5-55dc-4544-ea75-39b8105a3e5b"
+      },
+      "execution_count": 27,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Use the model `sentence-transformers/sentence-t5-large` to generate 
text embeddings. The model uses only the encoder from a `T5-large model`. The 
weights are stored in FP16. For more information about the model, see 
[Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text 
Models](https://arxiv.org/abs/2108.08877)."

Review Comment:
   ```suggestion
           "Next, we will use the model 
`sentence-transformers/sentence-t5-large` to generate text embeddings. The 
model uses only the encoder from a `T5-large model`. The weights are stored in 
FP16. For more information about the model, see [Sentence-T5: Scalable Sentence 
Encoders from Pre-trained Text-to-Text 
Models](https://arxiv.org/abs/2108.08877)."
   ```
   
   Is this example showing something meaningfully different from the one above 
it? If not, I'd cut it. If yes, then I'd emphasize the difference in this 
paragraph.



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To generate text embeddings with `MLTransform`, the following 
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text 
inputs from the Hugging Face blog [Getting Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+      ],
+      "metadata": {
+        "id": "Dbkmu3HP6Kql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "content = [\n",
+        "    {'x': 'How do I get a replacement Medicare card?'},\n",
+        "    {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+        "    {'x': 'How do I terminate my Medicare Part B (medical 
insurance)?'},\n",
+        "    {'x': 'How do I sign up for Medicare?'},\n",
+        "    {'x': 'Can I sign up for Medicare Part B if I am working and have 
health insurance through an employer?'},\n",
+        "    {'x': 'How do I sign up for Medicare Part B if I already have 
Part A?'},\n",
+        "    {'x': 'What are Medicare late enrollment penalties?'},\n",
+        "    {'x': 'What is Medicare and who can get it?'},\n",
+        "    {'x': 'How can I get help with my Medicare Part A and Part B 
premiums?'},\n",
+        "    {'x': 'What are the different parts of Medicare?'},\n",
+        "    {'x': 'Will my Medicare premiums be higher because of my higher 
income?'},\n",
+        "    {'x': 'What is TRICARE ?'},\n",
+        "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' 
Benefits?\"}\n",
+        "]\n",
+        "\n",
+        "\n",
+        "# helper function that returns a dict containing only first\n",
+        "#10 elements of generated embeddings.\n",
+        "def truncate_embeddings(d):\n",
+        "  for key in d.keys():\n",
+        "    d[key] = d[key][:10]\n",
+        "  return d"
+      ],
+      "metadata": {
+        "id": "LCTUs8F73iDg"
+      },
+      "execution_count": 25,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+        "text_embedding_model_name = 
'sentence-transformers/all-MiniLM-L6-v2'\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'])\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+        "\n",
+        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> 
beam.Map(print)\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "SF6izkN134sf",
+        "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'x': [-0.023889463394880295, 0.05525851249694824, 
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726, 
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368, 
0.01850851997733116, -0.08350814878940582]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.01268761046230793, 0.04687413573265076, 
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727, 
0.04232167452573776, 0.016627851873636246, -0.004099288955330849, 
-0.0026070312596857548, -0.010187783278524876]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [0.0004943296662531793, 0.11941202729940414, 
0.005229473114013672, -0.09273427724838257, 0.007772865705192089, 
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786, 
-0.006264965515583754, -0.006110507529228926]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.029711326584219933, 0.02329839952290058, 
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828, 
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754, 
-0.04512352868914604, -0.040747467428445816]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02562842145562172, 0.070388562977314, 
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075, 
0.052822552621364594, 0.06706249713897705, -0.05261750519275665, 
-0.054702047258615494, -0.11623040586709976]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.022656124085187912, 0.021159743890166283, 
0.0051048519089818, -0.04649421200156212, 0.009073587134480476, 
0.04149482399225235, 0.0542682446539402, -0.02418488636612892, 
-0.013482789508998394, -0.07596635073423386]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0029113641940057278, 0.060791268944740295, 
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759, 
0.036593958735466, 0.002054463606327772, -0.03134453296661377, 
0.03180575743317604, -0.02349487692117691]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.08052562177181244, 0.05988812819123268, 
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941, 
0.04184781387448311, 0.11904510855674744, 0.010651882737874985, 
-0.030094878748059273, -0.004561211448162794]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0343877375125885, 0.07250142097473145, 
0.01443990133702755, -0.03669498860836029, 0.014018685556948185, 
0.06307007372379303, 0.03468254581093788, -0.014530746266245842, 
-0.05986189469695091, -0.04538322612643242]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.005963834468275309, 0.025043703615665436, 
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116, 
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823, 
-0.03821341320872307, -0.04114910215139389]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.039007965475320816, -0.010609461925923824, 
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563, 
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938, 
-0.014126974157989025, -0.061636749655008316]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.09598278254270554, -0.06301165372133255, 
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278, 
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741, 
-0.04905705526471138, -0.031649429351091385]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.011600406840443611, 0.05651004612445831, 
0.016623979434370995, -0.09469003975391388, -0.009865491650998592, 
0.07234735041856766, 0.04412448778748512, -0.0411749929189682, 
-0.04212445020675659, -0.10263106226921082]}\n",
+            "Embedding shape: 10\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",
+        "\n",
+        "By passing `convert_to_numpy=False`, the output will contain 
`torch.Tensor`s."
+      ],
+      "metadata": {
+        "id": "1MFom0PW_vRv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm_with_inference_args = 
tempfile.mkdtemp(prefix='huggingface_')\n",
+        "\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'],\n",
+        "        inference_args={'convert_to_numpy': False}\n",
+        "        )\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",

Review Comment:
   Same comment applies elsewhere



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To generate text embeddings with `MLTransform`, the following 
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text 
inputs from the Hugging Face blog [Getting Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+      ],
+      "metadata": {
+        "id": "Dbkmu3HP6Kql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "content = [\n",
+        "    {'x': 'How do I get a replacement Medicare card?'},\n",
+        "    {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+        "    {'x': 'How do I terminate my Medicare Part B (medical 
insurance)?'},\n",
+        "    {'x': 'How do I sign up for Medicare?'},\n",
+        "    {'x': 'Can I sign up for Medicare Part B if I am working and have 
health insurance through an employer?'},\n",
+        "    {'x': 'How do I sign up for Medicare Part B if I already have 
Part A?'},\n",
+        "    {'x': 'What are Medicare late enrollment penalties?'},\n",
+        "    {'x': 'What is Medicare and who can get it?'},\n",
+        "    {'x': 'How can I get help with my Medicare Part A and Part B 
premiums?'},\n",
+        "    {'x': 'What are the different parts of Medicare?'},\n",
+        "    {'x': 'Will my Medicare premiums be higher because of my higher 
income?'},\n",
+        "    {'x': 'What is TRICARE ?'},\n",
+        "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' 
Benefits?\"}\n",
+        "]\n",
+        "\n",
+        "\n",
+        "# helper function that returns a dict containing only first\n",
+        "#10 elements of generated embeddings.\n",
+        "def truncate_embeddings(d):\n",
+        "  for key in d.keys():\n",
+        "    d[key] = d[key][:10]\n",
+        "  return d"
+      ],
+      "metadata": {
+        "id": "LCTUs8F73iDg"
+      },
+      "execution_count": 25,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+        "text_embedding_model_name = 
'sentence-transformers/all-MiniLM-L6-v2'\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'])\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+        "\n",
+        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> 
beam.Map(print)\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "SF6izkN134sf",
+        "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'x': [-0.023889463394880295, 0.05525851249694824, 
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726, 
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368, 
0.01850851997733116, -0.08350814878940582]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.01268761046230793, 0.04687413573265076, 
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727, 
0.04232167452573776, 0.016627851873636246, -0.004099288955330849, 
-0.0026070312596857548, -0.010187783278524876]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [0.0004943296662531793, 0.11941202729940414, 
0.005229473114013672, -0.09273427724838257, 0.007772865705192089, 
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786, 
-0.006264965515583754, -0.006110507529228926]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.029711326584219933, 0.02329839952290058, 
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828, 
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754, 
-0.04512352868914604, -0.040747467428445816]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02562842145562172, 0.070388562977314, 
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075, 
0.052822552621364594, 0.06706249713897705, -0.05261750519275665, 
-0.054702047258615494, -0.11623040586709976]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.022656124085187912, 0.021159743890166283, 
0.0051048519089818, -0.04649421200156212, 0.009073587134480476, 
0.04149482399225235, 0.0542682446539402, -0.02418488636612892, 
-0.013482789508998394, -0.07596635073423386]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0029113641940057278, 0.060791268944740295, 
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759, 
0.036593958735466, 0.002054463606327772, -0.03134453296661377, 
0.03180575743317604, -0.02349487692117691]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.08052562177181244, 0.05988812819123268, 
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941, 
0.04184781387448311, 0.11904510855674744, 0.010651882737874985, 
-0.030094878748059273, -0.004561211448162794]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0343877375125885, 0.07250142097473145, 
0.01443990133702755, -0.03669498860836029, 0.014018685556948185, 
0.06307007372379303, 0.03468254581093788, -0.014530746266245842, 
-0.05986189469695091, -0.04538322612643242]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.005963834468275309, 0.025043703615665436, 
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116, 
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823, 
-0.03821341320872307, -0.04114910215139389]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.039007965475320816, -0.010609461925923824, 
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563, 
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938, 
-0.014126974157989025, -0.061636749655008316]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.09598278254270554, -0.06301165372133255, 
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278, 
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741, 
-0.04905705526471138, -0.031649429351091385]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.011600406840443611, 0.05651004612445831, 
0.016623979434370995, -0.09469003975391388, -0.009865491650998592, 
0.07234735041856766, 0.04412448778748512, -0.0411749929189682, 
-0.04212445020675659, -0.10263106226921082]}\n",
+            "Embedding shape: 10\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",

Review Comment:
   ```suggestion
           "You can also pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",
   ```



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To generate text embeddings with `MLTransform`, the following 
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text 
inputs from the Hugging Face blog [Getting Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+      ],
+      "metadata": {
+        "id": "Dbkmu3HP6Kql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "content = [\n",
+        "    {'x': 'How do I get a replacement Medicare card?'},\n",
+        "    {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+        "    {'x': 'How do I terminate my Medicare Part B (medical 
insurance)?'},\n",
+        "    {'x': 'How do I sign up for Medicare?'},\n",
+        "    {'x': 'Can I sign up for Medicare Part B if I am working and have 
health insurance through an employer?'},\n",
+        "    {'x': 'How do I sign up for Medicare Part B if I already have 
Part A?'},\n",
+        "    {'x': 'What are Medicare late enrollment penalties?'},\n",
+        "    {'x': 'What is Medicare and who can get it?'},\n",
+        "    {'x': 'How can I get help with my Medicare Part A and Part B 
premiums?'},\n",
+        "    {'x': 'What are the different parts of Medicare?'},\n",
+        "    {'x': 'Will my Medicare premiums be higher because of my higher 
income?'},\n",
+        "    {'x': 'What is TRICARE ?'},\n",
+        "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' 
Benefits?\"}\n",
+        "]\n",
+        "\n",
+        "\n",
+        "# helper function that returns a dict containing only first\n",
+        "#10 elements of generated embeddings.\n",
+        "def truncate_embeddings(d):\n",
+        "  for key in d.keys():\n",
+        "    d[key] = d[key][:10]\n",
+        "  return d"
+      ],
+      "metadata": {
+        "id": "LCTUs8F73iDg"
+      },
+      "execution_count": 25,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+        "text_embedding_model_name = 
'sentence-transformers/all-MiniLM-L6-v2'\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'])\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+        "\n",
+        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> 
beam.Map(print)\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "SF6izkN134sf",
+        "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'x': [-0.023889463394880295, 0.05525851249694824, 
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726, 
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368, 
0.01850851997733116, -0.08350814878940582]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.01268761046230793, 0.04687413573265076, 
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727, 
0.04232167452573776, 0.016627851873636246, -0.004099288955330849, 
-0.0026070312596857548, -0.010187783278524876]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [0.0004943296662531793, 0.11941202729940414, 
0.005229473114013672, -0.09273427724838257, 0.007772865705192089, 
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786, 
-0.006264965515583754, -0.006110507529228926]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.029711326584219933, 0.02329839952290058, 
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828, 
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754, 
-0.04512352868914604, -0.040747467428445816]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02562842145562172, 0.070388562977314, 
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075, 
0.052822552621364594, 0.06706249713897705, -0.05261750519275665, 
-0.054702047258615494, -0.11623040586709976]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.022656124085187912, 0.021159743890166283, 
0.0051048519089818, -0.04649421200156212, 0.009073587134480476, 
0.04149482399225235, 0.0542682446539402, -0.02418488636612892, 
-0.013482789508998394, -0.07596635073423386]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0029113641940057278, 0.060791268944740295, 
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759, 
0.036593958735466, 0.002054463606327772, -0.03134453296661377, 
0.03180575743317604, -0.02349487692117691]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.08052562177181244, 0.05988812819123268, 
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941, 
0.04184781387448311, 0.11904510855674744, 0.010651882737874985, 
-0.030094878748059273, -0.004561211448162794]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0343877375125885, 0.07250142097473145, 
0.01443990133702755, -0.03669498860836029, 0.014018685556948185, 
0.06307007372379303, 0.03468254581093788, -0.014530746266245842, 
-0.05986189469695091, -0.04538322612643242]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.005963834468275309, 0.025043703615665436, 
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116, 
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823, 
-0.03821341320872307, -0.04114910215139389]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.039007965475320816, -0.010609461925923824, 
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563, 
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938, 
-0.014126974157989025, -0.061636749655008316]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.09598278254270554, -0.06301165372133255, 
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278, 
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741, 
-0.04905705526471138, -0.031649429351091385]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.011600406840443611, 0.05651004612445831, 
0.016623979434370995, -0.09469003975391388, -0.009865491650998592, 
0.07234735041856766, 0.04412448778748512, -0.0411749929189682, 
-0.04212445020675659, -0.10263106226921082]}\n",
+            "Embedding shape: 10\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",
+        "\n",
+        "By passing `convert_to_numpy=False`, the output will contain 
`torch.Tensor`s."
+      ],
+      "metadata": {
+        "id": "1MFom0PW_vRv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm_with_inference_args = 
tempfile.mkdtemp(prefix='huggingface_')\n",
+        "\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'],\n",
+        "        inference_args={'convert_to_numpy': False}\n",
+        "        )\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",

Review Comment:
   Nit: the formatting is a little funky here, this level of tabbing should 
match the expression right above it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add notebooks for text embeddings [beam]

Reply via email to