Re: [PR] Fix/enrichment docs #33012 [beam]

via GitHub Wed, 15 Jan 2025 14:08:20 -0800


damccorm commented on code in PR #33561:
URL: https://github.com/apache/beam/pull/33561#discussion_r1917393577



##########
website/www/site/content/en/documentation/transforms/python/elementwise/enrichment.md:
##########
@@ -46,6 +46,54 @@ The following examples demonstrate how to create a pipeline 
that use the enrichm
 | Vertex AI Feature Store (Legacy)   | [Enrichment with Legacy Vertex AI 
Feature 
Store](/documentation/transforms/python/elementwise/enrichment-vertexai/#example-2-enrichment-with-vertex-ai-feature-store-legacy)
 |
 {{< /table >}}
 
+## BigQuery Support
+
+The enrichment transform supports integration with **BigQuery** to dynamically 
enrich data using BigQuery datasets. By leveraging BigQuery as an external data 
source, users can execute efficient lookups for data enrichment directly in 
their Apache Beam pipelines. 
+
+To use BigQuery for enrichment:
+- Configure your BigQuery table as the data source for the enrichment process.
+- Ensure your pipeline has the appropriate credentials and permissions to 
access the BigQuery dataset.
+- Specify the query to extract the data to be used for enrichment.
+
+This integration is particularly beneficial for use cases that require 
augmenting real-time streaming data with information stored in BigQuery.
+
+---
+
+## Batching
+
+To optimize requests to external services, the enrichment transform uses 
batching. Instead of performing a lookup for each individual element, the 
transform groups multiple elements into a batch and performs a single lookup 
for the entire batch. 
+
+### Advantages of Batching:
+- **Improved Throughput**: Reduces the number of network calls.
+- **Lower Latency**: Fewer round trips to the external service.
+- **Cost Optimization**: Minimizes API call costs when working with paid 
external services.
+
+Users can configure the batch size by specifying parameters in their pipeline 
setup. Adjusting the batch size can help fine-tune the balance between 
throughput and latency.
+
+---
+
+## Caching with `with_redis_cache`
+
+For frequently used enrichment data, caching can significantly improve 
performance by reducing repeated calls to the remote service. Apache Beam's 
`with_redis_cache` method allows you to integrate a Redis cache into the 
enrichment pipeline. 
+
+### Benefits of Caching:
+- **Reduced Latency**: Fetches enrichment data from the cache instead of 
making network calls.
+- **Improved Resilience**: Minimizes the impact of network outages or service 
downtimes.
+- **Scalability**: Handles large volumes of enrichment requests efficiently.
+
+To enable caching:
+1. Set up a Redis instance accessible by your pipeline.
+2. Use the `with_redis_cache` method to configure the cache in your enrichment 
transform.
+3. Specify the time-to-live (TTL) for cache entries to ensure data freshness.
+
+Example:
+```python
+from apache_beam.transforms.enrichment import with_redis_cache
+
+# Enrichment pipeline with Redis cache
+enriched_data = (input_data 
+                 | 'Enrich with Cache' >> 
with_redis_cache(redis_config=redis_config, 
enrichment_transform=my_enrichment_transform))

Review Comment:
   I don't think this generated code is right, `with_redis_cache` should be 
appended to the enrichment transform like this - 
https://github.com/apache/beam/blob/b5fa8831c0369c6dff345ef69ab3becfdc02b650/sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_it_test.py#L333



##########
examples/notebooks/beam-ml/bigtable_enrichment_transform.ipynb:
##########
@@ -603,6 +603,37 @@
         "\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### What is a Cross-Join?\n",
+        "A cross-join is a Cartesian product operation where each row from one 
table is combined with every row from another table. It is useful when we want 
to create all possible combinations of two datasets.\n",
+        "\n",
+        "**Example:**\n",
+        "- Table A:\n",
+        "  | A1 | A2 |\n",
+        "  |----|----|\n",
+        "  |  1 |  X |\n",
+        "  |  2 |  Y |\n",
+        "\n",
+        "- Table B:\n",
+        "  | B1 | B2 |\n",
+        "  |----|----|\n",
+        "  | 10 |  P |\n",
+        "  | 20 |  Q |\n",
+        "\n",
+        "**Result of Cross-Join:**\n",
+        "  | A1 | A2 | B1 | B2 |\n",
+        "  |----|----|----|----|\n",
+        "  |  1 |  X | 10 |  P |\n",
+        "  |  1 |  X | 20 |  Q |\n",
+        "  |  2 |  Y | 10 |  P |\n",
+        "  |  2 |  Y | 20 |  Q |\n",
+        "\n",
+        "Cross-joins can be computationally expensive for large datasets, so 
use them judiciously.\n"

Review Comment:
   Could you combine this cell and the next one into a single cell? I think 
they're discussing the same thing.



##########
website/www/site/content/en/documentation/transforms/python/elementwise/enrichment.md:
##########
@@ -46,6 +46,54 @@ The following examples demonstrate how to create a pipeline 
that use the enrichm
 | Vertex AI Feature Store (Legacy)   | [Enrichment with Legacy Vertex AI 
Feature 
Store](/documentation/transforms/python/elementwise/enrichment-vertexai/#example-2-enrichment-with-vertex-ai-feature-store-legacy)
 |
 {{< /table >}}
 
+## BigQuery Support
+
+The enrichment transform supports integration with **BigQuery** to dynamically 
enrich data using BigQuery datasets. By leveraging BigQuery as an external data 
source, users can execute efficient lookups for data enrichment directly in 
their Apache Beam pipelines. 
+
+To use BigQuery for enrichment:
+- Configure your BigQuery table as the data source for the enrichment process.
+- Ensure your pipeline has the appropriate credentials and permissions to 
access the BigQuery dataset.
+- Specify the query to extract the data to be used for enrichment.
+
+This integration is particularly beneficial for use cases that require 
augmenting real-time streaming data with information stored in BigQuery.
+
+---
+
+## Batching
+
+To optimize requests to external services, the enrichment transform uses 
batching. Instead of performing a lookup for each individual element, the 
transform groups multiple elements into a batch and performs a single lookup 
for the entire batch. 
+
+### Advantages of Batching:
+- **Improved Throughput**: Reduces the number of network calls.
+- **Lower Latency**: Fewer round trips to the external service.
+- **Cost Optimization**: Minimizes API call costs when working with paid 
external services.
+
+Users can configure the batch size by specifying parameters in their pipeline 
setup. Adjusting the batch size can help fine-tune the balance between 
throughput and latency.
+
+---
+
+## Caching with `with_redis_cache`
+
+For frequently used enrichment data, caching can significantly improve 
performance by reducing repeated calls to the remote service. Apache Beam's 
`with_redis_cache` method allows you to integrate a Redis cache into the 
enrichment pipeline. 

Review Comment:
   Could you link to 
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.enrichment.html#apache_beam.transforms.enrichment.Enrichment.with_redis_cache
 here?



##########
examples/notebooks/beam-ml/bigtable_enrichment_transform.ipynb:
##########
@@ -603,6 +603,37 @@
         "\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### What is a Cross-Join?\n",
+        "A cross-join is a Cartesian product operation where each row from one 
table is combined with every row from another table. It is useful when we want 
to create all possible combinations of two datasets.\n",
+        "\n",
+        "**Example:**\n",
+        "- Table A:\n",

Review Comment:
   Could you remove the `-` - it throws off the formatting here and below
   
   ```suggestion
           "Table A:\n",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix/enrichment docs #33012 [beam]

Reply via email to