Re: [PR] [Draft] Batched DoFn prompt for Duet AI [beam]

via GitHub Thu, 01 Feb 2024 11:47:20 -0800


dariabezkorovaina commented on code in PR #30182:
URL: https://github.com/apache/beam/pull/30182#discussion_r1474980594



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.

Review Comment:
   ```suggestion
   In Apache Beam, a [batched 
‘DoFn’](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a specialized type of 
[‘DoFn’](https://beam.apache.org/documentation/programming-guide/#pardo) that 
operates on batches of logical elements within a 
[‘PCollection’](https://beam.apache.org/documentation/basics/#pcollection). 
Unlike regular ‘DoFn’s, which process each element individually, batched 
‘DoFn’s enable the efficient processing of multiple elements together as 
batches.
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.

Review Comment:
   ```suggestion
   Batched `DoFns` are currently only available in the Apache Beam Python SDK. 
They leverage powerful Python vectorized libraries, such as 
[NumPy](https://numpy.org/), [SciPy](https://scipy.org/), and 
[pandas](https://pandas.pydata.org/), which are designed for analysis and 
computing at scale. 
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.

Review Comment:
   ```suggestion
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.
+
+```python
+(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTen()) # Batches passed through
+   | beam.ParDo(MultiplyByTen()))
+```
+
+Apache Beam assumes that the `process_batch` method will  produce batched 
outputs and the `process` method will produce individual elements. This can be 
overridden with the `@beam.DoFn.yields_elements` and` 
@beam.DoFn.yields_batches` decorators.
+
+Please see 
[here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types)
 for a list of supported element and batch typehints.

Review Comment:
   ```suggestion
   For more details, including advanced usage and code samples, you can refer 
to the [Apache Beam documentation on batched 
‘DoFn’s](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types).
 
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`

Review Comment:
   Suggested edits to explain what is a batched DoFn first, before jumping to 
"mixed" DoFns 



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+

Review Comment:
   In the documentation, they do not say it is possible for all DoFns, they say 
"for some" 



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.

Review Comment:
   This part in the documentation was provided AFTER an example of an implicit 
batch explosion (below). Suggest adding a similar example to the below first, 
before the "passed through" example, so that it is easier to understand what is 
'chained' DoFn in contrast with 'non-chained' DoFn.
   
   (p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
      | beam.ParDo(MultiplyByTwo()) # Implicit buffering and batch creation
      | beam.Map(lambda x: x/3))  # Implicit batch explosion



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?

Review Comment:
   ```suggestion
   What is a batched 'DoFn' in Apache Beam?
   
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.

Review Comment:
   ```suggestion
   In this example, the ‘np.int64’ type represents the individual element. The 
'process' method multiplies an element by ten, yielding a single element. The 
‘np.ndarray’ type represents the batch. The 'process_batch' method multiplies 
each element in the batch by ten, yielding a single batch. During pipeline 
execution, Apache Beam will automatically select the best implementation based 
on the context. 
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.

Review Comment:
   ```suggestion
   By default, Apache Beam implicitly buffers elements and creates batches on 
the input side, then explodes batches back into individual elements on the 
output side. However, if batched 'DoFn's with equivalent types are chained 
together, this batch creation and explosion process is skipped, and the batches 
are passed through for more efficient processing.
   
   Here’s an example with chained ‘DoFn’s of equivalent types:
   
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+

Review Comment:
   ```suggestion
   For some ‘DoFn’s, you can provide both a batched and an element-wise 
implementation of your desired logic by defining both ‘process’ and 
‘process_batch’ methods. 
   
   Here is a simple example of a flexible ‘DoFn’ with both batched and 
element-wise implementations:
   
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+

Review Comment:
   ```suggestion
   Batched 'DoFn's are particularly useful when dealing with large amounts of 
data, allowing for parallel and optimized processing. Additionally, you can 
chain multiple batched ‘DoFn’s together to create a more efficient data 
processing pipeline.
   
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`

Review Comment:
   ```suggestion
   To implement a batched ‘DoFn’, you define a 
[‘process_batch’](https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/transforms/core.html#DoFn.process_batch)
 method instead of the typical 
[‘process’](https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/transforms/core.html#DoFn.process)
 method used in regular ‘DoFn’s. The ‘process_batch’ method takes a batch of 
elements as input and produces a batch of elements as an output. It is 
important to note that batches must have a single set of timing properties 
(event time, window, etc.) for every logical element within the batch; batches 
cannot span multiple timestamps. 
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.
+
+```python
+(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTen()) # Batches passed through
+   | beam.ParDo(MultiplyByTen()))
+```
+
+Apache Beam assumes that the `process_batch` method will  produce batched 
outputs and the `process` method will produce individual elements. This can be 
overridden with the `@beam.DoFn.yields_elements` and` 
@beam.DoFn.yields_batches` decorators.

Review Comment:
   ```suggestion
   The ‘process_batch’ method is expected to produce batched outputs, while the 
‘process’ method should produce individual elements. You can customize this 
production logic with the ‘@beam.DoFn.yields_elements’ and 
‘@beam.DoFn.yields_batches` decorators.
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.
+
+```python
+(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTen()) # Batches passed through
+   | beam.ParDo(MultiplyByTen()))
+```
+

Review Comment:
   ```suggestion
   In this example, the ‘PTransform.with_output_types’ method sets the 
element-wise typehint for the output. Thus, when the `MultiplyByTen` class is 
applied to a `PCollection`, Apache Beam recognizes that `np.ndarray` is an 
acceptable batch type to use in conjunction with `np.int64` elements.
   
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?

Review Comment:
   Although users will not be so picky, suggested edits for consistency with 
the rest of the file :) 



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+

Review Comment:
   Suggested because it is important to explain when to use batched DoFns 
before jumping to implementation 



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.
+
+```python
+(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTen()) # Batches passed through
+   | beam.ParDo(MultiplyByTen()))
+```
+
+Apache Beam assumes that the `process_batch` method will  produce batched 
outputs and the `process` method will produce individual elements. This can be 
overridden with the `@beam.DoFn.yields_elements` and` 
@beam.DoFn.yields_batches` decorators.
+
+Please see 
[here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types)
 for a list of supported element and batch typehints.
+
+Batches must have a single set of timing properties(event time, window) for 
every logical element in the batch and it's not possible to create batches that 
span multiple timestamps.

Review Comment:
   ```suggestion
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.
+
+```python
+(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTen()) # Batches passed through
+   | beam.ParDo(MultiplyByTen()))
+```
+
+Apache Beam assumes that the `process_batch` method will  produce batched 
outputs and the `process` method will produce individual elements. This can be 
overridden with the `@beam.DoFn.yields_elements` and` 
@beam.DoFn.yields_batches` decorators.
+
+Please see 
[here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types)
 for a list of supported element and batch typehints.
+

Review Comment:
   ```suggestion
   ```



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.

Review Comment:
   Suggested edits to explain what batched DoFns do in general first, then in 
the next paragraph we'll say that currently, they are only available in Python 
(and why). 



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.
+
+Apache Beam implicitly creates and explodes batches at the input and output 
boundaries. However, if Batched DoFns with equivalent types are chained 
together, this batch creation and explosion will be skipped making the compose 
batch transform more efficient.
+
+```python
+(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTen()) # Batches passed through
+   | beam.ParDo(MultiplyByTen()))
+```
+
+Apache Beam assumes that the `process_batch` method will  produce batched 
outputs and the `process` method will produce individual elements. This can be 
overridden with the `@beam.DoFn.yields_elements` and` 
@beam.DoFn.yields_batches` decorators.
+
+Please see 
[here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types)
 for a list of supported element and batch typehints.
+
+Batches must have a single set of timing properties(event time, window) for 
every logical element in the batch and it's not possible to create batches that 
span multiple timestamps.

Review Comment:
   I moved it to the earlier section with explanation of what's batched DoFn



##########
learning/prompts/documentation-lookup/31_advanced_batched_dofn.md:
##########
@@ -0,0 +1,35 @@
+Prompt:
+What is a Batched DoFn in Apache Beam?
+Response:
+In Apache Beam, a [Batched 
DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) 
is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) 
that can operate on batches of logical elements. These `DoFns` leverage popular 
Python vectorized libraries like [NumPy](https://numpy.org/), 
[SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to 
efficiently process batches of elements in a `PCollection` rather than 
processing each element individually.
+
+Batched `DoFns` are only available in the Apache Beam Python SDK.
+
+You can provide element-wise, batched or both implementation for a `DoFn`
+
+```python
+class MultiplyByTen(beam.DoFn):
+  def process(self, element: np.int64) -> Iterator[np.int64]:
+    yield element * 10
+
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 10
+```
+Apache Beam runner will select the best implementation during pipeline 
execution.
+
+Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that 
`np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` 
elements.

Review Comment:
   Moved this part to later, when the example is provided. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Draft] Batched DoFn prompt for Duet AI [beam]

Reply via email to