[GitHub] [beam] pcoet commented on a diff in pull request #23909: Initial draft of Batched DoFn user guide

GitBox Wed, 02 Nov 2022 09:56:35 -0700


pcoet commented on code in PR #23909:
URL: https://github.com/apache/beam/pull/23909#discussion_r1011982995



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -7799,3 +7799,246 @@ Dataflow supports multi-language pipelines through the 
Dataflow Runner v2 backen
 ### 13.4 Tips and Troubleshooting {#x-lang-transform-tips-troubleshooting}
 
 For additional tips and troubleshooting information, see 
[here](https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips).
+
+## 14 Batched DoFns {#batched-dofns}
+{{< language-switcher java py go typescript >}}
+
+{{< paragraph class="language-go language-java language-typescript" >}}
+Batched DoFns are currently a Python-only feature.
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+Batched DoFns enable users to create modular, composable, components that
+operate on batches of multiple logical elements. These DoFns can leverage
+vectorized Python libraries, like numpy, scipy, and pandas, which operate on
+batches of data for efficiency.
+{{< /paragraph >}}
+
+### 14.1 Basics {#batched-dofn-basics}
+{{< paragraph class="language-go language-java language-typescript" >}}
+Batched DoFns are currently a Python-only feature.
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+A trivial Batched DoFn might look like this:
+{{< /paragraph >}}
+
+{{< highlight py >}}
+class MultiplyByTwo(beam.DoFn):
+  # Type
+  def process_batch(self, batch: np.ndarray) -> Iterator[np.ndarray]:
+    yield batch * 2
+
+  # Declare what the element-wise output type is
+  def infer_output_type(self, input_element_type):
+    return np.int64
+{{< /highlight >}}
+
+{{< paragraph class="language-py" >}}
+This DoFn can be used in a Beam pipeline that otherwise operates on individual
+elements. Beam will implicitly buffer elements and create numpy arrays on the
+input side, and on the output side it will explode the numpy arrays back into
+individual elements:
+{{< /paragraph >}}
+
+{{< highlight py >}}
+(p | beam.Create([1,2,3,4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTwo) # Implicit buffering and batch creation
+   | beam.Map(lambda x: x/3))  # Implicit batch explosion
+{{< /highlight >}}
+
+{{< paragraph class="language-py" >}}
+However, if Batched DoFns with equivalent types are chained together, this
+batching and unbatching will be elided. The batches will be passed straight
+through! This makes it much simpler to compose transforms that operate on
+batches.
+{{< /paragraph >}}
+
+{{< highlight py >}}
+(p | beam.Create([1,2,3,4]).with_output_types(np.int64)
+   | beam.ParDo(MultiplyByTwo) # Implicit buffering and batch creation
+   | beam.ParDo(MultiplyByTwo) # Batches passed through
+   | beam.ParDo(MultiplyByTwo))
+{{< /highlight >}}
+
+{{< paragraph class="language-py" >}}
+Note that the typehints on the Batched DoFn are *critical*. This is how

Review Comment:
   Would it be useful to add a sentence calling out the type hint in the code 
snippet: `.with_output_types(np.int64)`? I thought that Python type hints were 
something else (https://docs.python.org/3/library/typing.html), and I had to 
read more to realize that the type hint is provided by this method and arg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] pcoet commented on a diff in pull request #23909: Initial draft of Batched DoFn user guide

Reply via email to