[GitHub] [beam] lukecwik commented on a change in pull request #15781: [BEAM-11758] Update basics page: Splittable DoFn

GitBox Tue, 26 Oct 2021 10:51:36 -0700


lukecwik commented on a change in pull request #15781:
URL: https://github.com/apache/beam/pull/15781#discussion_r736781842




##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -360,3 +364,42 @@ For more information about runners, see the following 
pages:
 
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
+
+### Splittable DoFn
+
+Splittable `DoFn` (SDF) is a generalization of `DoFn` that lets you process
+elements in a non-monolithic way. Splittable `DoFn` makes it easier to create
+complex, modular I/O connectors in Beam.
+
+A regular `ParDo` processes an entire element at a time, applying your regular
+`DoFn` and waiting for the call to terminate. When you instead apply a
+splittable `DoFn` to each element, the runner has the option of splitting the
+element's processing into smaller tasks. You can checkpoint the processing of 
an
+element, and you can split the remaining work to yield additional parallelism.
+
+For example, imagine you want to read every line from very large text files.
+When you write your splittable `DoFn`, you can have separate pieces of logic to
+read a segment of a file, split a segment of a file into sub-segments, and
+report progress through the current segment. The runner can then invoke your
+splittable `DoFn` intelligently to split up each input and read portions
+separately, in parallel.
+
+A common computation pattern has the following steps:
+
+ 1. The runner splits an incoming element before starting any processing.
+ 2. The runner starts running your processing logic on each sub-element.
+ 3. If the runner notices that some sub-elements are taking longer than others,
+    the runner halts processing of those sub-elements and splits again.
+ 4. Repeat from step 2.

Review comment:
       The runner doesn't halt processing, the current restriction being 
processed is sub-divided while the restriction is being processed. For example 
the current sub-element is the range `[100, 200)` and the current processing is 
at `140` then the range is split into two `[100, Y)` and `[Y, 200)` where `Y > 
140`.
   
   Also, I think it would make sense that part of the common pattern is that 
users also get to decide to checkpoint at any time as you described above.
   
   How about something like:
   ```suggestion
    1. The runner splits an incoming element before starting any processing.
    2. The runner starts running your processing logic on each sub-element.
    3. If the runner notices that some sub-elements are taking longer than 
others,
       the runner splits those sub-elements further and repeats step 2.
    4. The sub-element finishes processing or the user chooses to checkpoint 
the sub-element and repeats step 2.
   ```

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -42,6 +42,10 @@ understand an important set of core concepts:
    them to a runner.
  * [_Runner_](#runner) - A runner runs a Beam pipeline using the capabilities 
of
    your chosen data processing engine.
+ * [_Splittable DoFn_](#splittable-dofn) - Splittable DoFns let you process
+   elements in a non-monolithic way. You can checkpoint the processing of
+   an element, and you can split the remaining work to yield additional

Review comment:
       ```suggestion
      an element, and the runner can split the remaining work to yield 
additional
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] lukecwik commented on a change in pull request #15781: [BEAM-11758] Update basics page: Splittable DoFn

Reply via email to