AnandInguva commented on code in PR #29893:
URL: https://github.com/apache/beam/pull/29893#discussion_r1440673432
##########
examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb:
##########
@@ -0,0 +1,456 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Generate Text Embeddings by using Hugging Face Hub models\n",
+ "\n",
+ "<table align=\"left\">\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\"
/>Run in Google Colab</a>\n",
+ " </td>\n",
+ " <td>\n",
+ " <a target=\"_blank\"
href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb\"><img
src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\"
/>View source on GitHub</a>\n",
+ " </td>\n",
+ "</table>\n"
+ ],
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "## Text Embeddings\n",
+ "\n",
+ "Text embeddings are a way of representing text as numerical vectors.
This allows computers to understand and process text data, which is essential
for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "### Uses of text embeddings\n",
+ "By converting text into numerical vectors, text embeddings make it
possible for computers to process and analyze text data. This enables a wide
range of NLP tasks, including:\n",
+ "\n",
+ "* Semantic search: Finding documents or passages that are relevant to
a query, even if the query doesn't use the exact same words as the
documents.\n",
+ "* Text classification: Categorzing text data into different classes,
such as spam or not spam, or positive sentiment or negative sentiment.\n",
+ "* Machine translation: Translating text from one language to another
while preserving the meaning.\n",
+ "* Text summarization: Creating shorter summaries of longer pieces of
text.\n",
+ "\n",
+ "In this notebook, we will use Apache Beam's `MLTransform` to
embeddings on the text data.\n",
+ "\n",
+ "Hugging Face's
[`SentenceTransformers`](https://huggingface.co/sentence-transformers)
framework uses Python to generate sentence, text, and image embeddings.\n",
+ "\n",
+ "To generate text embeddings that use Hugging Face models and
`MLTransform`, use `SentenceTransformerEmbeddings` to specify the model
configuration.\n",
+ "\n",
+ "To use `SentenceTransformerEmbeddings`, first install the `the
sentence-transformers` package."
+ ],
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Install dependencies\n",
+ " Install Apache Beam and the dependencies needed to work with Hugging
Face embeddings."
+ ],
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "! git clone https://github.com/apache/beam.git\n",
+ "! cd beam/sdks/python\n",
+ "! pip install beam/sdks/python\n",
+ "! pip install sentence-transformers"
+ ],
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.huggingface import
SentenceTransformerEmbeddings"
+ ],
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "execution_count": 24,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their
attributes to an artifact location. These transforms are used when you run
`MLTransform` in `read` mode.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data
with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in
the Apache Beam documentation."
+ ],
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To generate text embeddings with `MLTransform`, the following
pipeline uses the model `sentence-transformers/all-MiniLM-L6-v2` and the text
inputs from the Hugging Face blog [Getting Started With
Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)."
+ ],
+ "metadata": {
+ "id": "Dbkmu3HP6Kql"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "content = [\n",
+ " {'x': 'How do I get a replacement Medicare card?'},\n",
+ " {'x': 'What is the monthly premium for Medicare Part B?'},\n",
+ " {'x': 'How do I terminate my Medicare Part B (medical
insurance)?'},\n",
+ " {'x': 'How do I sign up for Medicare?'},\n",
+ " {'x': 'Can I sign up for Medicare Part B if I am working and have
health insurance through an employer?'},\n",
+ " {'x': 'How do I sign up for Medicare Part B if I already have
Part A?'},\n",
+ " {'x': 'What are Medicare late enrollment penalties?'},\n",
+ " {'x': 'What is Medicare and who can get it?'},\n",
+ " {'x': 'How can I get help with my Medicare Part A and Part B
premiums?'},\n",
+ " {'x': 'What are the different parts of Medicare?'},\n",
+ " {'x': 'Will my Medicare premiums be higher because of my higher
income?'},\n",
+ " {'x': 'What is TRICARE ?'},\n",
+ " {'x': \"Should I sign up for Medicare Part B if I have Veterans'
Benefits?\"}\n",
+ "]\n",
+ "\n",
+ "\n",
+ "# helper function that returns a dict containing only first\n",
+ "#10 elements of generated embeddings.\n",
+ "def truncate_embeddings(d):\n",
+ " for key in d.keys():\n",
+ " d[key] = d[key][:10]\n",
+ " return d"
+ ],
+ "metadata": {
+ "id": "LCTUs8F73iDg"
+ },
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm = tempfile.mkdtemp(prefix='huggingface_')\n",
+ "text_embedding_model_name =
'sentence-transformers/all-MiniLM-L6-v2'\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'])\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm).with_transform(embedding_transform))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >>
beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "SF6izkN134sf",
+ "outputId": "740f450a-dc9c-4c9d-f4fb-8ef27cca3d74"
+ },
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'x': [-0.023889463394880295, 0.05525851249694824,
-0.011654896661639214, -0.03341428190469742, -0.012260555289685726,
-0.024872763082385063, -0.01266342680901289, 0.025345895439386368,
0.01850851997733116, -0.08350814878940582]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.01268761046230793, 0.04687413573265076,
-0.010502150282263756, -0.020383981987833977, -0.01336114201694727,
0.04232167452573776, 0.016627851873636246, -0.004099288955330849,
-0.0026070312596857548, -0.010187783278524876]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [0.0004943296662531793, 0.11941202729940414,
0.005229473114013672, -0.09273427724838257, 0.007772865705192089,
-0.005324989557266235, 0.03450643643736839, -0.05198145657777786,
-0.006264965515583754, -0.006110507529228926]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.029711326584219933, 0.02329839952290058,
-0.05704096704721451, -0.01218305341899395, -0.013710316270589828,
0.02979600988328457, 0.0637386366724968, 0.0011010386515408754,
-0.04512352868914604, -0.040747467428445816]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.02562842145562172, 0.070388562977314,
-0.017379559576511383, -0.0565667562186718, 0.02857644483447075,
0.052822552621364594, 0.06706249713897705, -0.05261750519275665,
-0.054702047258615494, -0.11623040586709976]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.022656124085187912, 0.021159743890166283,
0.0051048519089818, -0.04649421200156212, 0.009073587134480476,
0.04149482399225235, 0.0542682446539402, -0.02418488636612892,
-0.013482789508998394, -0.07596635073423386]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0029113641940057278, 0.060791268944740295,
-0.009175681509077549, -0.006133317016065121, 0.04049248993396759,
0.036593958735466, 0.002054463606327772, -0.03134453296661377,
0.03180575743317604, -0.02349487692117691]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.08052562177181244, 0.05988812819123268,
-0.048846807330846786, -0.040176115930080414, -0.06334187835454941,
0.04184781387448311, 0.11904510855674744, 0.010651882737874985,
-0.030094878748059273, -0.004561211448162794]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.0343877375125885, 0.07250142097473145,
0.01443990133702755, -0.03669498860836029, 0.014018685556948185,
0.06307007372379303, 0.03468254581093788, -0.014530746266245842,
-0.05986189469695091, -0.04538322612643242]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.005963834468275309, 0.025043703615665436,
-0.003182061715051532, -0.025242920964956284, -0.0398230254650116,
-0.012771873734891415, 0.0447133406996727, 0.014535333029925823,
-0.03821341320872307, -0.04114910215139389]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.039007965475320816, -0.010609461925923824,
-0.007382705342024565, -0.050189778208732605, -0.0025175788905471563,
-0.0416409894824028, 0.02696940489113331, -0.014800631441175938,
-0.014126974157989025, -0.061636749655008316]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.09598278254270554, -0.06301165372133255,
-0.11690578609704971, -0.05907457321882248, -0.05132286250591278,
-0.0034391973167657852, 0.018687350675463676, 0.006543711293488741,
-0.04905705526471138, -0.031649429351091385]}\n",
+ "Embedding shape: 10\n",
+ "{'x': [-0.011600406840443611, 0.05651004612445831,
0.016623979434370995, -0.09469003975391388, -0.009865491650998592,
0.07234735041856766, 0.04412448778748512, -0.0411749929189682,
-0.04212445020675659, -0.10263106226921082]}\n",
+ "Embedding shape: 10\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Pass additional arguments that are supported by
`sentence-transformer` models, such as `convert_to_numpy=False`. These
arguments are passed as a `dict` to the `SentenceTransformerEmbeddings`
transform by using the `inference_args` parameter.\n",
+ "\n",
+ "By passing `convert_to_numpy=False`, the output will contain
`torch.Tensor`s."
+ ],
+ "metadata": {
+ "id": "1MFom0PW_vRv"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "artifact_location_minilm_with_inference_args =
tempfile.mkdtemp(prefix='huggingface_')\n",
+ "\n",
+ "embedding_transform = SentenceTransformerEmbeddings(\n",
+ " model_name=text_embedding_model_name, columns=['x'],\n",
+ " inference_args={'convert_to_numpy': False}\n",
+ " )\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >>
MLTransform(write_artifact_location=artifact_location_minilm_with_inference_args).with_transform(embedding_transform))\n",
+ "\n",
+ " # The outputs are in the Pytorch tensor type.\n",
+ " transformed_pcoll | 'LogOutput' >> beam.Map(lambda x:
print(type(x['x'])))\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x:
print(f\"Embedding shape: {len(x['x'])}\"))\n"
+ ],
+ "metadata": {
+ "id": "xyezKuzY_uLD",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "d09a07d5-55dc-4544-ea75-39b8105a3e5b"
+ },
+ "execution_count": 27,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n",
+ "Embedding shape: 384\n",
+ "<class 'torch.Tensor'>\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Use the model `sentence-transformers/sentence-t5-large` to generate
text embeddings. The model uses only the encoder from a `T5-large model`. The
weights are stored in FP16. For more information about the model, see
[Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text
Models](https://arxiv.org/abs/2108.08877)."
Review Comment:
no, it just uses a bigger model. The above snipper uses a smaller model. I
think I could remove the snippet that uses a smaller model.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]