dariabezkorovaina commented on code in PR #30182: URL: https://github.com/apache/beam/pull/30182#discussion_r1474980594
########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. Review Comment: ```suggestion In Apache Beam, a [batched ‘DoFn’](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a specialized type of [‘DoFn’](https://beam.apache.org/documentation/programming-guide/#pardo) that operates on batches of logical elements within a [‘PCollection’](https://beam.apache.org/documentation/basics/#pcollection). Unlike regular ‘DoFn’s, which process each element individually, batched ‘DoFn’s enable the efficient processing of multiple elements together as batches. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. Review Comment: ```suggestion Batched `DoFns` are currently only available in the Apache Beam Python SDK. They leverage powerful Python vectorized libraries, such as [NumPy](https://numpy.org/), [SciPy](https://scipy.org/), and [pandas](https://pandas.pydata.org/), which are designed for analysis and computing at scale. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. Review Comment: ```suggestion ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. + +```python +(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) + | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation + | beam.ParDo(MultiplyByTen()) # Batches passed through + | beam.ParDo(MultiplyByTen())) +``` + +Apache Beam assumes that the `process_batch` method will produce batched outputs and the `process` method will produce individual elements. This can be overridden with the `@beam.DoFn.yields_elements` and` @beam.DoFn.yields_batches` decorators. + +Please see [here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types) for a list of supported element and batch typehints. Review Comment: ```suggestion For more details, including advanced usage and code samples, you can refer to the [Apache Beam documentation on batched ‘DoFn’s](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types). ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` Review Comment: Suggested edits to explain what is a batched DoFn first, before jumping to "mixed" DoFns ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + Review Comment: In the documentation, they do not say it is possible for all DoFns, they say "for some" ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. Review Comment: This part in the documentation was provided AFTER an example of an implicit batch explosion (below). Suggest adding a similar example to the below first, before the "passed through" example, so that it is easier to understand what is 'chained' DoFn in contrast with 'non-chained' DoFn. (p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) | beam.ParDo(MultiplyByTwo()) # Implicit buffering and batch creation | beam.Map(lambda x: x/3)) # Implicit batch explosion ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? Review Comment: ```suggestion What is a batched 'DoFn' in Apache Beam? ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. Review Comment: ```suggestion In this example, the ‘np.int64’ type represents the individual element. The 'process' method multiplies an element by ten, yielding a single element. The ‘np.ndarray’ type represents the batch. The 'process_batch' method multiplies each element in the batch by ten, yielding a single batch. During pipeline execution, Apache Beam will automatically select the best implementation based on the context. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. Review Comment: ```suggestion By default, Apache Beam implicitly buffers elements and creates batches on the input side, then explodes batches back into individual elements on the output side. However, if batched 'DoFn's with equivalent types are chained together, this batch creation and explosion process is skipped, and the batches are passed through for more efficient processing. Here’s an example with chained ‘DoFn’s of equivalent types: ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + Review Comment: ```suggestion For some ‘DoFn’s, you can provide both a batched and an element-wise implementation of your desired logic by defining both ‘process’ and ‘process_batch’ methods. Here is a simple example of a flexible ‘DoFn’ with both batched and element-wise implementations: ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + Review Comment: ```suggestion Batched 'DoFn's are particularly useful when dealing with large amounts of data, allowing for parallel and optimized processing. Additionally, you can chain multiple batched ‘DoFn’s together to create a more efficient data processing pipeline. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` Review Comment: ```suggestion To implement a batched ‘DoFn’, you define a [‘process_batch’](https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/transforms/core.html#DoFn.process_batch) method instead of the typical [‘process’](https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/transforms/core.html#DoFn.process) method used in regular ‘DoFn’s. The ‘process_batch’ method takes a batch of elements as input and produces a batch of elements as an output. It is important to note that batches must have a single set of timing properties (event time, window, etc.) for every logical element within the batch; batches cannot span multiple timestamps. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. + +```python +(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) + | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation + | beam.ParDo(MultiplyByTen()) # Batches passed through + | beam.ParDo(MultiplyByTen())) +``` + +Apache Beam assumes that the `process_batch` method will produce batched outputs and the `process` method will produce individual elements. This can be overridden with the `@beam.DoFn.yields_elements` and` @beam.DoFn.yields_batches` decorators. Review Comment: ```suggestion The ‘process_batch’ method is expected to produce batched outputs, while the ‘process’ method should produce individual elements. You can customize this production logic with the ‘@beam.DoFn.yields_elements’ and ‘@beam.DoFn.yields_batches` decorators. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. + +```python +(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) + | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation + | beam.ParDo(MultiplyByTen()) # Batches passed through + | beam.ParDo(MultiplyByTen())) +``` + Review Comment: ```suggestion In this example, the ‘PTransform.with_output_types’ method sets the element-wise typehint for the output. Thus, when the `MultiplyByTen` class is applied to a `PCollection`, Apache Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? Review Comment: Although users will not be so picky, suggested edits for consistency with the rest of the file :) ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + Review Comment: Suggested because it is important to explain when to use batched DoFns before jumping to implementation ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. + +```python +(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) + | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation + | beam.ParDo(MultiplyByTen()) # Batches passed through + | beam.ParDo(MultiplyByTen())) +``` + +Apache Beam assumes that the `process_batch` method will produce batched outputs and the `process` method will produce individual elements. This can be overridden with the `@beam.DoFn.yields_elements` and` @beam.DoFn.yields_batches` decorators. + +Please see [here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types) for a list of supported element and batch typehints. + +Batches must have a single set of timing properties(event time, window) for every logical element in the batch and it's not possible to create batches that span multiple timestamps. Review Comment: ```suggestion ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. + +```python +(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) + | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation + | beam.ParDo(MultiplyByTen()) # Batches passed through + | beam.ParDo(MultiplyByTen())) +``` + +Apache Beam assumes that the `process_batch` method will produce batched outputs and the `process` method will produce individual elements. This can be overridden with the `@beam.DoFn.yields_elements` and` @beam.DoFn.yields_batches` decorators. + +Please see [here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types) for a list of supported element and batch typehints. + Review Comment: ```suggestion ``` ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. Review Comment: Suggested edits to explain what batched DoFns do in general first, then in the next paragraph we'll say that currently, they are only available in Python (and why). ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. + +Apache Beam implicitly creates and explodes batches at the input and output boundaries. However, if Batched DoFns with equivalent types are chained together, this batch creation and explosion will be skipped making the compose batch transform more efficient. + +```python +(p | beam.Create([1, 2, 3, 4]).with_output_types(np.int64) + | beam.ParDo(MultiplyByTen()) # Implicit buffering and batch creation + | beam.ParDo(MultiplyByTen()) # Batches passed through + | beam.ParDo(MultiplyByTen())) +``` + +Apache Beam assumes that the `process_batch` method will produce batched outputs and the `process` method will produce individual elements. This can be overridden with the `@beam.DoFn.yields_elements` and` @beam.DoFn.yields_batches` decorators. + +Please see [here](https://beam.apache.org/documentation/programming-guide/#batched-dofn-types) for a list of supported element and batch typehints. + +Batches must have a single set of timing properties(event time, window) for every logical element in the batch and it's not possible to create batches that span multiple timestamps. Review Comment: I moved it to the earlier section with explanation of what's batched DoFn ########## learning/prompts/documentation-lookup/31_advanced_batched_dofn.md: ########## @@ -0,0 +1,35 @@ +Prompt: +What is a Batched DoFn in Apache Beam? +Response: +In Apache Beam, a [Batched DoFn](https://beam.apache.org/documentation/programming-guide/#batched-dofns) is a [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that can operate on batches of logical elements. These `DoFns` leverage popular Python vectorized libraries like [NumPy](https://numpy.org/), [SciPy](https://scipy.org/) and [pandas](https://pandas.pydata.org/) to efficiently process batches of elements in a `PCollection` rather than processing each element individually. + +Batched `DoFns` are only available in the Apache Beam Python SDK. + +You can provide element-wise, batched or both implementation for a `DoFn` + +```python +class MultiplyByTen(beam.DoFn): + def process(self, element: np.int64) -> Iterator[np.int64]: + yield element * 10 + + def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]: + yield batch * 10 +``` +Apache Beam runner will select the best implementation during pipeline execution. + +Then, when `MultiplyByTen` is applied to a `PCollection`, Beam recognizes that `np.ndarray` is an acceptable batch type to use in conjunction with `np.int64` elements. Review Comment: Moved this part to later, when the example is provided. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
