This is an automated email from the ASF dual-hosted git repository.
anandinguva pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 75cfbee1591 Update MLTransform docs (#29910)
75cfbee1591 is described below
commit 75cfbee1591b99ff02d0a6a19631199e719b44fa
Author: Anand Inguva <[email protected]>
AuthorDate: Mon Jan 8 17:10:00 2024 +0000
Update MLTransform docs (#29910)
* Update MLTransform docs
* Update MLTransform docs
* Apply suggestions from code review
Co-authored-by: Rebecca Szper <[email protected]>
* Update preprocess-data.md
* Apply suggestions from code review
Co-authored-by: Rebecca Szper <[email protected]>
* Move embeddings content higher up
* Fix website checks
* Revert "Fix website checks"
This reverts commit 229f7eddf4ce8b2847ab248c0fadb20fa4f898d1.
* Fix website checks
---------
Co-authored-by: Rebecca Szper <[email protected]>
---
.../content/en/documentation/ml/preprocess-data.md | 75 +++++++---------------
1 file changed, 24 insertions(+), 51 deletions(-)
diff --git a/website/www/site/content/en/documentation/ml/preprocess-data.md
b/website/www/site/content/en/documentation/ml/preprocess-data.md
index 2b291b9c75a..1365926d3cc 100644
--- a/website/www/site/content/en/documentation/ml/preprocess-data.md
+++ b/website/www/site/content/en/documentation/ml/preprocess-data.md
@@ -23,16 +23,11 @@ preprocessing data for training and inference. The
`MLTransform` class wraps the
various transforms in one class, simplifying your workflow. For a full list of
available transforms, see the [Transforms](#transforms) section on this page.
-The set of transforms currently available in the `MLTransform` class come from
-the TensorFlow Transforms (TFT) library. TFT offers specialized processing
-modules for machine learning tasks.
-
## Why use MLTransform {#use-mltransform}
- With `MLTransform`, you can use the same preprocessing steps for both
training and inference, which ensures consistent results.
-- Use `MLTransform` to transform a single example or a batch of
- examples.
+- Generate [embeddings](https://en.wikipedia.org/wiki/Embedding) on text
data using large language models (LLMs).
- `MLTransform` can do a full pass on the dataset, which is useful when
you need to transform a single element only after analyzing the entire
dataset. For example, with `MLTransform`, you can complete the following
tasks:
@@ -45,18 +40,33 @@ modules for machine learning tasks.
- Count the occurrences of words in all the documents to calculate
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
weights.
-
## Support and limitations {#support}
-- Available in the Apache Beam Python SDK versions 2.50.0 and later.
-- Supports Python 3.8 and 3.9.
+- Available in the Apache Beam Python SDK versions 2.53.0 and later.
+- Supports Python 3.8, 3.9, and 3.10.
- Only available for pipelines that use [default
windows](/documentation/programming-guide/#single-global-window).
-- Only supports one-to-one transform mapping on a single element.
## Transforms {#transforms}
-You can use `MLTransform` to perform the following data processing transforms.
-For information about the transforms, see
+You can use `MLTransform` to generate text embeddings and to perform various
data processing transforms.
+
+### Text embedding transforms
+
+You can use `MLTranform` to generate embeddings that you can use to push data
into vector databases or to run inference.
+
+{{< table >}}
+| Transform name | Description |
+| ------- | ---------------|
+| SentenceTransformerEmbeddings | Uses the Hugging Face
[`sentence-transformers`](https://huggingface.co/sentence-transformers) models
to generate text embeddings.
+| VertexAITextEmbeddings | Uses models from the [the Vertex AI text-embeddings
API](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)
to generate text embeddings.
+{{< /table >}}
+
+
+### Data processing transforms that use TFT
+
+The following set of transforms available in the `MLTransform` class come from
+the TensorFlow Transforms (TFT) library. TFT offers specialized processing
+modules for machine learning tasks. For information about these transforms, see
[Module:tft](https://www.tensorflow.org/tfx/transform/api_docs/python/tft) in
the
TensorFlow documentation.
@@ -73,18 +83,10 @@ TensorFlow documentation.
| TFIDF | See
[`tft.tfidf`](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/tfidf)
in the TensorFlow documentation. |:
{{< /table >}}
-Apply the transforms on either single or multiple columns passed as a
-`dict` on structured data. Keys are column names and values are lists
containing
-each column's data.
-
## I/O requirements {#io}
-- Input to the `MLTransform` class must be in one of the following formats:
- - A `dict` of `str`
- - Primitive types
- - List of primitive types
- - NumPy arrays
-- `MLTransform` outputs a Beam `Row` object with NumPy arrays.
+- Input to the `MLTransform` class must be a dictionary.
+- `MLTransform` outputs a Beam `Row` object with transformed elements.
- The output `PCollection` is a schema `PCollection`. The output schema
contains the transformed columns.
@@ -197,32 +199,3 @@ Replace the following values:
For more examples, see
[MLTransform for data
processing](/documentation/transforms/python/elementwise/mltransform)
in the [transform catalog](/documentation/transforms/python/overview/).
-
-### ScaleTo01 example {#scaleto01}
-
-This example demonstrates how to use `MLTransform` to normalize your data
-between 0 and 1 by using the minimum and maximum values from your entire
-dataset. `MLTransform` uses the `ScaleTo01` transformation.
-
-Use the following snippet to apply `ScaleTo01` on column `x` of the input
-data.
-
-```
-data_pcoll |
MLTransform(write_artifact_location=<LOCATION>).with_transform(ScaleTo01(columns=['x']))
-```
-
-The `ScaleTo01` transformation produces two artifacts: the `min` and the `max`
-of the entire dataset. For more information, see the
-[Artifacts](#artifacts) section on this page.
-
-## Metrics {#metrics}
-
-When you use MLTransform, the following metrics are available.
-
-{{< table >}}
-| Metric | Description |
-| ------- | ---------------|
-| Data throughput | The number of records processed per second. This metric
indicates the processing capacity of the pipeline for `beam.MLTransform.` |
-| Memory usage | The number of records processed per second. This metric
indicates the processing capacity of the pipeline for `beam.MLTransform`. |
-| Counters | Tracks the number of elements processed. Each `MLTransform` has a
counter. |:
-{{< /table >}}
\ No newline at end of file