[GitHub] [beam] kennknowles commented on a change in pull request #15378: [RFC] Define and document per-key ordering semantics for runners

GitBox Wed, 29 Sep 2021 17:23:45 -0700


kennknowles commented on a change in pull request #15378:
URL: https://github.com/apache/beam/pull/15378#discussion_r718806316




##########
File path: website/www/site/content/en/documentation/runtime/model.md
##########
@@ -77,6 +82,48 @@ element, and having to retry everything if there is a 
failure. For example, a
 streaming runner may prefer to process and commit small bundles, and a batch
 runner may prefer to process larger bundles.
 
+### Data partitioning and inter-stage execution
+
+Partitioning and parallelization of element processing within a Beam pipeline 
is
+dependent on two things:
+
+- Data source implementation
+- Inter-stage key parallelism
+
+Beam pipelines read data from a source (e.g. `KafkaIO`, `BigQueryIO`, `JdbcIO`,
+or your own source implementation). To implement a Source in Beam one must
+implement it as a Splittable `DoFn`. A Splittable `DoFn` provides the runner
+with interfaces to facilitate the splitting of work.
+
+When running key-based operations in Beam (e.g. `GroupByKey`, `Combine`,
+`Reshuffle.perKey`, and stateful `DoFn`s), Beam runners perform serialization
+and transfer of data known as *shuffle*<sup>1</sup>. Shuffle allows data
+elements of the same key to be processed together.
+
+The way in which runners *shuffle* data may be slightly different for Batch and
+Streaming execution modes.
+
+<sup>1</sup>Not to be confused with the `shuffle` operation in some runners.
+
+#### Data ordering in a pipeline execution
+The Beam model does not define strict guidelines regarding the order in which
+runners process elements or transport them across `PTransforms`. Runners are
+free to implement shuffling semantics in different forms.
+
+Some use cases exist where user pipelines may need to rely on specific ordering
+semantics in pipeline execution. The [capability matrix 
documents](/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html)
+runner behavior for **key-ordered delivery**.
+
+Consider a single Beam worker processing a series of bundles from the same Beam
+stage, and consider a `PTransform` that outputs data from this Stage into a
+downstream `PCollection`. Finally, consider two events *with the same key*
+emitted in a certain order by this worker (within the same bundle or as part of
+different bundles).
+
+We say that the Beam runner supports **key-ordered delivery** if it guarantees
+that these two events will be observed downstream in the same order,
+independently of the kind of transmission.

Review comment:
       I was thinking about this comment and it occurs to me that you have to 
have key-limited parallelism in the producer of course. I don't think this is 
required anywhere in the model (state/timers require key+window limited 
parallelism) and also not mentioned here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] kennknowles commented on a change in pull request #15378: [RFC] Define and document per-key ordering semantics for runners

Reply via email to