Re: [PR] Add notebooks for text embeddings [beam]

via GitHub Wed, 03 Jan 2024 08:51:39 -0800


AnandInguva commented on code in PR #29893:
URL: https://github.com/apache/beam/pull/29893#discussion_r1440673432



##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "UmEFwsNs1OES"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Generate Text Embeddings by using Hugging Face Hub models\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\";
 />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" 
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\";><img
 
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\";
 />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "ZUSiAR62SgO8"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Text Embeddings\n",
+        "\n",
+        "Text embeddings are a way of representing text as numerical vectors. 
This allows computers to understand and process text data, which is essential 
for many natural language processing (NLP) tasks.\n",
+        "\n",
+        "### Uses of text embeddings\n",
+        "By converting text into numerical vectors, text embeddings make it 
possible for computers to process and analyze text data. This enables a wide 
range of NLP tasks, including:\n",
+        "\n",
+        "* Semantic search: Finding documents or passages that are relevant to 
a query, even if the query doesn't use the exact same words as the 
documents.\n",
+        "* Text classification: Categorzing text data into different classes, 
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+        "* Machine translation: Translating text from one language to another 
while preserving the meaning.\n",
+        "* Text summarization: Creating shorter summaries of longer pieces of 
text.\n",
+        "\n",
+        "In this notebook, we will use Apache Beam's `MLTransform` to 
embeddings on the text data.\n",
+        "\n",
+        "Hugging Face's 
[`SentenceTransformers`](https://huggingface.co/sentence-transformers) 
framework uses Python to generate sentence, text, and image embeddings.\n",
+        "\n",
+        "To generate text embeddings that use Hugging Face models and 
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model 
configuration.\n",
+        "\n",
+        "To use `SentenceTransformerEmbeddings`, first install the `the 
sentence-transformers` package."
+      ],
+      "metadata": {
+        "id": "yvVIEhF01ZWq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Install dependencies\n",
+        " Install Apache Beam and the dependencies needed to work with Hugging 
Face embeddings."
+      ],
+      "metadata": {
+        "id": "jqYXaBJ821Zs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "! git clone https://github.com/apache/beam.git\n";,
+        "! cd beam/sdks/python\n",
+        "! pip install beam/sdks/python\n",
+        "! pip install sentence-transformers"
+      ],
+      "metadata": {
+        "id": "shzCUrZI1XhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tempfile\n",
+        "import apache_beam as beam\n",
+        "from apache_beam.ml.transforms.base import MLTransform\n",
+        "from apache_beam.ml.transforms.embeddings.huggingface import 
SentenceTransformerEmbeddings"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3b"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Use MLTransform in write mode\n",
+        "\n",
+        "In `write` mode, `MLTransform` saves the transforms and their 
attributes to an artifact location. These transforms are used when you run 
`MLTransform` in `read` mode.\n",
+        "\n",
+        "For more information about using `MLTransform`, see [Preprocess data 
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in 
the Apache Beam documentation."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To generate text embeddings with `MLTransform`, the following 
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text 
inputs from the Hugging Face blog [Getting Started With 
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+      ],
+      "metadata": {
+        "id": "Dbkmu3HP6Kql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "content = [\n",
+        "    {'x': 'How do I get a replacement Medicare card?'},\n",
+        "    {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+        "    {'x': 'How do I terminate my Medicare Part B (medical 
insurance)?'},\n",
+        "    {'x': 'How do I sign up for Medicare?'},\n",
+        "    {'x': 'Can I sign up for Medicare Part B if I am working and have 
health insurance through an employer?'},\n",
+        "    {'x': 'How do I sign up for Medicare Part B if I already have 
Part A?'},\n",
+        "    {'x': 'What are Medicare late enrollment penalties?'},\n",
+        "    {'x': 'What is Medicare and who can get it?'},\n",
+        "    {'x': 'How can I get help with my Medicare Part A and Part B 
premiums?'},\n",
+        "    {'x': 'What are the different parts of Medicare?'},\n",
+        "    {'x': 'Will my Medicare premiums be higher because of my higher 
income?'},\n",
+        "    {'x': 'What is TRICARE ?'},\n",
+        "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' 
Benefits?\"}\n",
+        "]\n",
+        "\n",
+        "\n",
+        "# helper function that returns a dict containing only first\n",
+        "#10 elements of generated embeddings.\n",
+        "def truncate_embeddings(d):\n",
+        "  for key in d.keys():\n",
+        "    d[key] = d[key][:10]\n",
+        "  return d"
+      ],
+      "metadata": {
+        "id": "LCTUs8F73iDg"
+      },
+      "execution_count": 25,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+        "text_embedding_model_name = 
'sentence-transformers/all-MiniLM-L6-v2'\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'])\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+        "\n",
+        "  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> 
beam.Map(print)\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "SF6izkN134sf",
+        "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'x': [-0.023889463394880295, 0.05525851249694824, 
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726, 
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368, 
0.01850851997733116, -0.08350814878940582]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.01268761046230793, 0.04687413573265076, 
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727, 
0.04232167452573776, 0.016627851873636246, -0.004099288955330849, 
-0.0026070312596857548, -0.010187783278524876]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [0.0004943296662531793, 0.11941202729940414, 
0.005229473114013672, -0.09273427724838257, 0.007772865705192089, 
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786, 
-0.006264965515583754, -0.006110507529228926]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.029711326584219933, 0.02329839952290058, 
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828, 
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754, 
-0.04512352868914604, -0.040747467428445816]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.02562842145562172, 0.070388562977314, 
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075, 
0.052822552621364594, 0.06706249713897705, -0.05261750519275665, 
-0.054702047258615494, -0.11623040586709976]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.022656124085187912, 0.021159743890166283, 
0.0051048519089818, -0.04649421200156212, 0.009073587134480476, 
0.04149482399225235, 0.0542682446539402, -0.02418488636612892, 
-0.013482789508998394, -0.07596635073423386]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0029113641940057278, 0.060791268944740295, 
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759, 
0.036593958735466, 0.002054463606327772, -0.03134453296661377, 
0.03180575743317604, -0.02349487692117691]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.08052562177181244, 0.05988812819123268, 
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941, 
0.04184781387448311, 0.11904510855674744, 0.010651882737874985, 
-0.030094878748059273, -0.004561211448162794]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.0343877375125885, 0.07250142097473145, 
0.01443990133702755, -0.03669498860836029, 0.014018685556948185, 
0.06307007372379303, 0.03468254581093788, -0.014530746266245842, 
-0.05986189469695091, -0.04538322612643242]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.005963834468275309, 0.025043703615665436, 
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116, 
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823, 
-0.03821341320872307, -0.04114910215139389]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.039007965475320816, -0.010609461925923824, 
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563, 
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938, 
-0.014126974157989025, -0.061636749655008316]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.09598278254270554, -0.06301165372133255, 
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278, 
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741, 
-0.04905705526471138, -0.031649429351091385]}\n",
+            "Embedding shape: 10\n",
+            "{'x': [-0.011600406840443611, 0.05651004612445831, 
0.016623979434370995, -0.09469003975391388, -0.009865491650998592, 
0.07234735041856766, 0.04412448778748512, -0.0411749929189682, 
-0.04212445020675659, -0.10263106226921082]}\n",
+            "Embedding shape: 10\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Pass additional arguments that are supported by 
`sentence-transformer` models, such as `convert_to_numpy=False`. These 
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings` 
transform by using the `inference_args` parameter.\n",
+        "\n",
+        "By passing `convert_to_numpy=False`, the output will contain 
`torch.Tensor`s."
+      ],
+      "metadata": {
+        "id": "1MFom0PW_vRv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "artifact_location_minilm_with_inference_args = 
tempfile.mkdtemp(prefix='huggingface_')\n",
+        "\n",
+        "embedding_transform = SentenceTransformerEmbeddings(\n",
+        "        model_name=text_embedding_model_name, columns=['x'],\n",
+        "        inference_args={'convert_to_numpy': False}\n",
+        "        )\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  data_pcoll = (\n",
+        "          pipeline\n",
+        "          | \"CreateData\" >> beam.Create(content))\n",
+        "  transformed_pcoll = (\n",
+        "      data_pcoll\n",
+        "      | \"MLTransform\" >> 
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",
+        "\n",
+        "  # The outputs are in the Pytorch tensor type.\n",
+        "  transformed_pcoll | 'LogOutput' >> beam.Map(lambda x: 
print(type(x['x'])))\n",
+        "\n",
+        "  transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: 
print(f\"Embedding shape: {len(x['x'])}\"))\n"
+      ],
+      "metadata": {
+        "id": "xyezKuzY_uLD",
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "outputId": "d09a07d5-55dc-4544-ea75-39b8105a3e5b"
+      },
+      "execution_count": 27,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n",
+            "Embedding shape: 384\n",
+            "<class 'torch.Tensor'>\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Use the model `sentence-transformers/sentence-t5-large` to generate 
text embeddings. The model uses only the encoder from a `T5-large model`. The 
weights are stored in FP16. For more information about the model, see 
[Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text 
Models](https://arxiv.org/abs/2108.08877)."

Review Comment:
   no, it just uses a bigger model. The above snipper uses a smaller model. I 
think I could remove the snippet that uses a smaller model.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add notebooks for text embeddings [beam]

Reply via email to