damccorm commented on code in PR #29893:
URL: https://github.com/apache/beam/pull/29893#discussion_r1439774797
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
Review Comment:
(optional): This has some redundancy with the previous sentence (outside of
this header). Maybe just remove this line and say `Some NLP tasks that use text
embeddings include:`
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
Review Comment:
This probably belongs in the next section. Maybe something like:
```
Install Apache Beam and the dependencies needed to work with Hugging Face
embeddings. This includes the `sentence-transformers package` which is required
to use the `SentenceTransformerEmbeddings` module.
```
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in
the Apache Beam documentation."
+ ],
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To generate text embeddings with `MLTransform`, the following
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text
inputs from the Hugging Face blog [Getting Started With
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
Review Comment:
I think the column piece might be less natural for people here, so maybe we
could add a sentence that explains that MLTransform operates on columns
specified in the `SentenceTransformerEmbeddings` column
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
Review Comment:
```suggestion
"* Text classification: Categorizng text data into different
classes, such as spam or not spam, or positive sentiment or negative
sentiment.\n",
```
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
Review Comment:
```suggestion
"In this notebook, we will use Apache Beam's `MLTransform` to
generate embeddings from text data.\n",
```
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
Review Comment:
This section might benefit from a brief introduction of MLTransform (e.g.
"MLTransform is a transform that can be used for a variety of machine learning
pre- and post-processing operations and data preparation, including generating
embeddings.")
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
Review Comment:
Nit: remove leading space
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
Review Comment:
and maybe after the `read` mode sentence we could add a small snippet about
why read/write mode is useful (maybe just copy the first bullet in
https://beam.apache.org/documentation/ml/preprocess-data/#use-mltransform)
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in
the Apache Beam documentation."
+ ],
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To generate text embeddings with `MLTransform`, the following
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text
inputs from the Hugging Face blog [Getting Started With
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+ ],
+ "metadata": {
+ "id": "Dbkmu3HP6Kql"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "content = [\n",
+ " {'x': 'How do I get a replacement Medicare card?'},\n",
+ " {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+ " {'x': 'How do I terminate my Medicare Part B (medical
insurance)?'},\n",
+ " {'x': 'How do I sign up for Medicare?'},\n",
+ " {'x': 'Can I sign up for Medicare Part B if I am working and have
health insurance through an employer?'},\n",
+ " {'x': 'How do I sign up for Medicare Part B if I already have
Part A?'},\n",
+ " {'x': 'What are Medicare late enrollment penalties?'},\n",
+ " {'x': 'What is Medicare and who can get it?'},\n",
+ " {'x': 'How can I get help with my Medicare Part A and Part B
premiums?'},\n",
+ " {'x': 'What are the different parts of Medicare?'},\n",
+ " {'x': 'Will my Medicare premiums be higher because of my higher
income?'},\n",
+ " {'x': 'What is TRICARE ?'},\n",
+ " {'x': \"Should I sign up for Medicare Part B if I have Veterans'
Benefits?\"}\n",
+ "]\n",
+ "\n",
+ "\n",
+ "# helper function that returns a dict containing only first\n",
+ "#10 elements of generated embeddings.\n",
+ "def truncate_embeddings(d):\n",
+ " for key in d.keys():\n",
+ " d[key] = d[key][:10]\n",
+ " return d"
+ ],
+ "metadata": {
+ "id": "LCTUs8F73iDg"
+ },
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+ "text_embedding_model_name =
'sentence-transformers/all-MiniLM-L6-v2'\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'])\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >>
beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "SF6izkN134sf",
+ "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+ },
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'x': [-0.023889463394880295, 0.05525851249694824,
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726,
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368,
0.01850851997733116, -0.08350814878940582]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.01268761046230793, 0.04687413573265076,
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727,
0.04232167452573776, 0.016627851873636246, -0.004099288955330849,
-0.0026070312596857548, -0.010187783278524876]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [0.0004943296662531793, 0.11941202729940414,
0.005229473114013672, -0.09273427724838257, 0.007772865705192089,
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786,
-0.006264965515583754, -0.006110507529228926]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.029711326584219933, 0.02329839952290058,
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828,
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754,
-0.04512352868914604, -0.040747467428445816]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.02562842145562172, 0.070388562977314,
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075,
0.052822552621364594, 0.06706249713897705, -0.05261750519275665,
-0.054702047258615494, -0.11623040586709976]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.022656124085187912, 0.021159743890166283,
0.0051048519089818, -0.04649421200156212, 0.009073587134480476,
0.04149482399225235, 0.0542682446539402, -0.02418488636612892,
-0.013482789508998394, -0.07596635073423386]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0029113641940057278, 0.060791268944740295,
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759,
0.036593958735466, 0.002054463606327772, -0.03134453296661377,
0.03180575743317604, -0.02349487692117691]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.08052562177181244, 0.05988812819123268,
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941,
0.04184781387448311, 0.11904510855674744, 0.010651882737874985,
-0.030094878748059273, -0.004561211448162794]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0343877375125885, 0.07250142097473145,
0.01443990133702755, -0.03669498860836029, 0.014018685556948185,
0.06307007372379303, 0.03468254581093788, -0.014530746266245842,
-0.05986189469695091, -0.04538322612643242]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.005963834468275309, 0.025043703615665436,
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116,
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823,
-0.03821341320872307, -0.04114910215139389]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.039007965475320816, -0.010609461925923824,
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563,
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938,
-0.014126974157989025, -0.061636749655008316]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.09598278254270554, -0.06301165372133255,
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278,
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741,
-0.04905705526471138, -0.031649429351091385]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.011600406840443611, 0.05651004612445831,
0.016623979434370995, -0.09469003975391388, -0.009865491650998592,
0.07234735041856766, 0.04412448778748512, -0.0411749929189682,
-0.04212445020675659, -0.10263106226921082]}\n",
+ "Embedding shape: 10\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Pass additional arguments that are supported by
`sentence-transformer` models, such as `convert_to_numpy=False`. These
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings`
transform by using the `inference_args` parameter.\n",
+ "\n",
+ "By passing `convert_to_numpy=False`, the output will contain
`torch.Tensor`s."
+ ],
+ "metadata": {
+ "id": "1MFom0PW_vRv"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm_with_inference_args =
tempfile.mkdtemp(prefix='huggingface_')\n",
+ "\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'],\n",
+ " inference_args={'convert_to_numpy': False}\n",
+ " )\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",
+ "\n",
+ " # The outputs are in the Pytorch tensor type.\n",
+ " transformed_pcoll | 'LogOutput' >> beam.Map(lambda x:
print(type(x['x'])))\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))\n"
+ ],
+ "metadata": {
+ "id": "xyezKuzY_uLD",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "d09a07d5-55dc-4544-ea75-39b8105a3e5b"
+ },
+ "execution_count": 27,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Use the model `sentence-transformers/sentence-t5-large` to generate
text embeddings. The model uses only the encoder from a `T5-large model`. The
weights are stored in FP16. For more information about the model, see
[Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text
Models](https://arxiv.org/abs/2108.08877)."
Review Comment:
```suggestion
"Next, we will use the model
`sentence-transformers/sentence-t5-large` to generate text embeddings. The
model uses only the encoder from a `T5-large model`. The weights are stored in
FP16. For more information about the model, see [Sentence-T5: Scalable Sentence
Encoders from Pre-trained Text-to-Text
Models](https://arxiv.org/abs/2108.08877)."
```
Is this example showing something meaningfully different from the one above
it? If not, I'd cut it. If yes, then I'd emphasize the difference in this
paragraph.
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in
the Apache Beam documentation."
+ ],
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To generate text embeddings with `MLTransform`, the following
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text
inputs from the Hugging Face blog [Getting Started With
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+ ],
+ "metadata": {
+ "id": "Dbkmu3HP6Kql"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "content = [\n",
+ " {'x': 'How do I get a replacement Medicare card?'},\n",
+ " {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+ " {'x': 'How do I terminate my Medicare Part B (medical
insurance)?'},\n",
+ " {'x': 'How do I sign up for Medicare?'},\n",
+ " {'x': 'Can I sign up for Medicare Part B if I am working and have
health insurance through an employer?'},\n",
+ " {'x': 'How do I sign up for Medicare Part B if I already have
Part A?'},\n",
+ " {'x': 'What are Medicare late enrollment penalties?'},\n",
+ " {'x': 'What is Medicare and who can get it?'},\n",
+ " {'x': 'How can I get help with my Medicare Part A and Part B
premiums?'},\n",
+ " {'x': 'What are the different parts of Medicare?'},\n",
+ " {'x': 'Will my Medicare premiums be higher because of my higher
income?'},\n",
+ " {'x': 'What is TRICARE ?'},\n",
+ " {'x': \"Should I sign up for Medicare Part B if I have Veterans'
Benefits?\"}\n",
+ "]\n",
+ "\n",
+ "\n",
+ "# helper function that returns a dict containing only first\n",
+ "#10 elements of generated embeddings.\n",
+ "def truncate_embeddings(d):\n",
+ " for key in d.keys():\n",
+ " d[key] = d[key][:10]\n",
+ " return d"
+ ],
+ "metadata": {
+ "id": "LCTUs8F73iDg"
+ },
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+ "text_embedding_model_name =
'sentence-transformers/all-MiniLM-L6-v2'\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'])\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >>
beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "SF6izkN134sf",
+ "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+ },
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'x': [-0.023889463394880295, 0.05525851249694824,
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726,
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368,
0.01850851997733116, -0.08350814878940582]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.01268761046230793, 0.04687413573265076,
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727,
0.04232167452573776, 0.016627851873636246, -0.004099288955330849,
-0.0026070312596857548, -0.010187783278524876]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [0.0004943296662531793, 0.11941202729940414,
0.005229473114013672, -0.09273427724838257, 0.007772865705192089,
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786,
-0.006264965515583754, -0.006110507529228926]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.029711326584219933, 0.02329839952290058,
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828,
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754,
-0.04512352868914604, -0.040747467428445816]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.02562842145562172, 0.070388562977314,
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075,
0.052822552621364594, 0.06706249713897705, -0.05261750519275665,
-0.054702047258615494, -0.11623040586709976]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.022656124085187912, 0.021159743890166283,
0.0051048519089818, -0.04649421200156212, 0.009073587134480476,
0.04149482399225235, 0.0542682446539402, -0.02418488636612892,
-0.013482789508998394, -0.07596635073423386]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0029113641940057278, 0.060791268944740295,
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759,
0.036593958735466, 0.002054463606327772, -0.03134453296661377,
0.03180575743317604, -0.02349487692117691]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.08052562177181244, 0.05988812819123268,
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941,
0.04184781387448311, 0.11904510855674744, 0.010651882737874985,
-0.030094878748059273, -0.004561211448162794]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0343877375125885, 0.07250142097473145,
0.01443990133702755, -0.03669498860836029, 0.014018685556948185,
0.06307007372379303, 0.03468254581093788, -0.014530746266245842,
-0.05986189469695091, -0.04538322612643242]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.005963834468275309, 0.025043703615665436,
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116,
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823,
-0.03821341320872307, -0.04114910215139389]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.039007965475320816, -0.010609461925923824,
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563,
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938,
-0.014126974157989025, -0.061636749655008316]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.09598278254270554, -0.06301165372133255,
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278,
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741,
-0.04905705526471138, -0.031649429351091385]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.011600406840443611, 0.05651004612445831,
0.016623979434370995, -0.09469003975391388, -0.009865491650998592,
0.07234735041856766, 0.04412448778748512, -0.0411749929189682,
-0.04212445020675659, -0.10263106226921082]}\n",
+ "Embedding shape: 10\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Pass additional arguments that are supported by
`sentence-transformer` models, such as `convert_to_numpy=False`. These
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings`
transform by using the `inference_args` parameter.\n",
+ "\n",
+ "By passing `convert_to_numpy=False`, the output will contain
`torch.Tensor`s."
+ ],
+ "metadata": {
+ "id": "1MFom0PW_vRv"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm_with_inference_args =
tempfile.mkdtemp(prefix='huggingface_')\n",
+ "\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'],\n",
+ " inference_args={'convert_to_numpy': False}\n",
+ " )\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",
Review Comment:
Same comment applies elsewhere
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in
the Apache Beam documentation."
+ ],
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To generate text embeddings with `MLTransform`, the following
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text
inputs from the Hugging Face blog [Getting Started With
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+ ],
+ "metadata": {
+ "id": "Dbkmu3HP6Kql"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "content = [\n",
+ " {'x': 'How do I get a replacement Medicare card?'},\n",
+ " {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+ " {'x': 'How do I terminate my Medicare Part B (medical
insurance)?'},\n",
+ " {'x': 'How do I sign up for Medicare?'},\n",
+ " {'x': 'Can I sign up for Medicare Part B if I am working and have
health insurance through an employer?'},\n",
+ " {'x': 'How do I sign up for Medicare Part B if I already have
Part A?'},\n",
+ " {'x': 'What are Medicare late enrollment penalties?'},\n",
+ " {'x': 'What is Medicare and who can get it?'},\n",
+ " {'x': 'How can I get help with my Medicare Part A and Part B
premiums?'},\n",
+ " {'x': 'What are the different parts of Medicare?'},\n",
+ " {'x': 'Will my Medicare premiums be higher because of my higher
income?'},\n",
+ " {'x': 'What is TRICARE ?'},\n",
+ " {'x': \"Should I sign up for Medicare Part B if I have Veterans'
Benefits?\"}\n",
+ "]\n",
+ "\n",
+ "\n",
+ "# helper function that returns a dict containing only first\n",
+ "#10 elements of generated embeddings.\n",
+ "def truncate_embeddings(d):\n",
+ " for key in d.keys():\n",
+ " d[key] = d[key][:10]\n",
+ " return d"
+ ],
+ "metadata": {
+ "id": "LCTUs8F73iDg"
+ },
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+ "text_embedding_model_name =
'sentence-transformers/all-MiniLM-L6-v2'\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'])\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >>
beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "SF6izkN134sf",
+ "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+ },
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'x': [-0.023889463394880295, 0.05525851249694824,
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726,
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368,
0.01850851997733116, -0.08350814878940582]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.01268761046230793, 0.04687413573265076,
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727,
0.04232167452573776, 0.016627851873636246, -0.004099288955330849,
-0.0026070312596857548, -0.010187783278524876]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [0.0004943296662531793, 0.11941202729940414,
0.005229473114013672, -0.09273427724838257, 0.007772865705192089,
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786,
-0.006264965515583754, -0.006110507529228926]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.029711326584219933, 0.02329839952290058,
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828,
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754,
-0.04512352868914604, -0.040747467428445816]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.02562842145562172, 0.070388562977314,
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075,
0.052822552621364594, 0.06706249713897705, -0.05261750519275665,
-0.054702047258615494, -0.11623040586709976]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.022656124085187912, 0.021159743890166283,
0.0051048519089818, -0.04649421200156212, 0.009073587134480476,
0.04149482399225235, 0.0542682446539402, -0.02418488636612892,
-0.013482789508998394, -0.07596635073423386]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0029113641940057278, 0.060791268944740295,
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759,
0.036593958735466, 0.002054463606327772, -0.03134453296661377,
0.03180575743317604, -0.02349487692117691]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.08052562177181244, 0.05988812819123268,
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941,
0.04184781387448311, 0.11904510855674744, 0.010651882737874985,
-0.030094878748059273, -0.004561211448162794]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0343877375125885, 0.07250142097473145,
0.01443990133702755, -0.03669498860836029, 0.014018685556948185,
0.06307007372379303, 0.03468254581093788, -0.014530746266245842,
-0.05986189469695091, -0.04538322612643242]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.005963834468275309, 0.025043703615665436,
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116,
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823,
-0.03821341320872307, -0.04114910215139389]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.039007965475320816, -0.010609461925923824,
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563,
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938,
-0.014126974157989025, -0.061636749655008316]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.09598278254270554, -0.06301165372133255,
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278,
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741,
-0.04905705526471138, -0.031649429351091385]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.011600406840443611, 0.05651004612445831,
0.016623979434370995, -0.09469003975391388, -0.009865491650998592,
0.07234735041856766, 0.04412448778748512, -0.0411749929189682,
-0.04212445020675659, -0.10263106226921082]}\n",
+ "Embedding shape: 10\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Pass additional arguments that are supported by
`sentence-transformer` models, such as `convert_to_numpy=False`. These
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings`
transform by using the `inference_args` parameter.\n",
Review Comment:
```suggestion
"You can also pass additional arguments that are supported by
`sentence-transformer` models, such as `convert_to_numpy=False`. These
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings`
transform by using the `inference_args` parameter.\n",
```
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in
the Apache Beam documentation."
+ ],
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To generate text embeddings with `MLTransform`, the following
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text
inputs from the Hugging Face blog [Getting Started With
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+ ],
+ "metadata": {
+ "id": "Dbkmu3HP6Kql"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "content = [\n",
+ " {'x': 'How do I get a replacement Medicare card?'},\n",
+ " {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+ " {'x': 'How do I terminate my Medicare Part B (medical
insurance)?'},\n",
+ " {'x': 'How do I sign up for Medicare?'},\n",
+ " {'x': 'Can I sign up for Medicare Part B if I am working and have
health insurance through an employer?'},\n",
+ " {'x': 'How do I sign up for Medicare Part B if I already have
Part A?'},\n",
+ " {'x': 'What are Medicare late enrollment penalties?'},\n",
+ " {'x': 'What is Medicare and who can get it?'},\n",
+ " {'x': 'How can I get help with my Medicare Part A and Part B
premiums?'},\n",
+ " {'x': 'What are the different parts of Medicare?'},\n",
+ " {'x': 'Will my Medicare premiums be higher because of my higher
income?'},\n",
+ " {'x': 'What is TRICARE ?'},\n",
+ " {'x': \"Should I sign up for Medicare Part B if I have Veterans'
Benefits?\"}\n",
+ "]\n",
+ "\n",
+ "\n",
+ "# helper function that returns a dict containing only first\n",
+ "#10 elements of generated embeddings.\n",
+ "def truncate_embeddings(d):\n",
+ " for key in d.keys():\n",
+ " d[key] = d[key][:10]\n",
+ " return d"
+ ],
+ "metadata": {
+ "id": "LCTUs8F73iDg"
+ },
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+ "text_embedding_model_name =
'sentence-transformers/all-MiniLM-L6-v2'\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'])\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >>
beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "SF6izkN134sf",
+ "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+ },
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'x': [-0.023889463394880295, 0.05525851249694824,
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726,
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368,
0.01850851997733116, -0.08350814878940582]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.01268761046230793, 0.04687413573265076,
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727,
0.04232167452573776, 0.016627851873636246, -0.004099288955330849,
-0.0026070312596857548, -0.010187783278524876]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [0.0004943296662531793, 0.11941202729940414,
0.005229473114013672, -0.09273427724838257, 0.007772865705192089,
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786,
-0.006264965515583754, -0.006110507529228926]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.029711326584219933, 0.02329839952290058,
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828,
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754,
-0.04512352868914604, -0.040747467428445816]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.02562842145562172, 0.070388562977314,
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075,
0.052822552621364594, 0.06706249713897705, -0.05261750519275665,
-0.054702047258615494, -0.11623040586709976]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.022656124085187912, 0.021159743890166283,
0.0051048519089818, -0.04649421200156212, 0.009073587134480476,
0.04149482399225235, 0.0542682446539402, -0.02418488636612892,
-0.013482789508998394, -0.07596635073423386]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0029113641940057278, 0.060791268944740295,
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759,
0.036593958735466, 0.002054463606327772, -0.03134453296661377,
0.03180575743317604, -0.02349487692117691]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.08052562177181244, 0.05988812819123268,
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941,
0.04184781387448311, 0.11904510855674744, 0.010651882737874985,
-0.030094878748059273, -0.004561211448162794]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0343877375125885, 0.07250142097473145,
0.01443990133702755, -0.03669498860836029, 0.014018685556948185,
0.06307007372379303, 0.03468254581093788, -0.014530746266245842,
-0.05986189469695091, -0.04538322612643242]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.005963834468275309, 0.025043703615665436,
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116,
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823,
-0.03821341320872307, -0.04114910215139389]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.039007965475320816, -0.010609461925923824,
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563,
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938,
-0.014126974157989025, -0.061636749655008316]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.09598278254270554, -0.06301165372133255,
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278,
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741,
-0.04905705526471138, -0.031649429351091385]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.011600406840443611, 0.05651004612445831,
0.016623979434370995, -0.09469003975391388, -0.009865491650998592,
0.07234735041856766, 0.04412448778748512, -0.0411749929189682,
-0.04212445020675659, -0.10263106226921082]}\n",
+ "Embedding shape: 10\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Pass additional arguments that are supported by
`sentence-transformer` models, such as `convert_to_numpy=False`. These
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings`
transform by using the `inference_args` parameter.\n",
+ "\n",
+ "By passing `convert_to_numpy=False`, the output will contain
`torch.Tensor`s."
+ ],
+ "metadata": {
+ "id": "1MFom0PW_vRv"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm_with_inference_args =
tempfile.mkdtemp(prefix='huggingface_')\n",
+ "\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'],\n",
+ " inference_args={'convert_to_numpy': False}\n",
+ " )\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",
Review Comment:
Nit: the formatting is a little funky here, this level of tabbing should
match the expression right above it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]