This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push:
new efdbfba Publishing website 2021/12/03 18:03:23 at commit 862ece1
efdbfba is described below
commit efdbfbab9e1221d92319bbc4728d12382197ac73
Author: jenkins <[email protected]>
AuthorDate: Fri Dec 3 18:03:24 2021 +0000
Publishing website 2021/12/03 18:03:23 at commit 862ece1
---
website/generated-content/documentation/index.xml | 39 +++++++++++++-
.../index.html | 2 +-
.../runners/capability-matrix/index.html | 2 +-
.../runners/capability-matrix/index.xml | 62 ++++++++++++++++++++++
.../documentation/runtime/model/index.html | 29 ++++++++--
website/generated-content/sitemap.xml | 2 +-
6 files changed, 128 insertions(+), 8 deletions(-)
diff --git a/website/generated-content/documentation/index.xml
b/website/generated-content/documentation/index.xml
index 7289a86..fc279a0 100644
--- a/website/generated-content/documentation/index.xml
+++ b/website/generated-content/documentation/index.xml
@@ -15395,7 +15395,9 @@ serializing the elements and broadcasting them to all
the workers executing
the <code>ParDo</code>.</li>
<li>Passing elements between transforms that are running on the same worker.
This may allow the runner to avoid serializing elements; instead, the runner
-can just pass the elements in memory.</li>
+can just pass the elements in memory. This is done as part of an
+optimization that is known as
+<a
href="https://beam.apache.org/documentation/glossary/#fusion">fusion</a>.</li>
</ul>
<p>Some situations where the runner may serialize and persist elements
are:</p>
<ol>
@@ -15420,6 +15422,41 @@ choose an appropriate middle-ground between persisting
results after every
element, and having to retry everything if there is a failure. For example, a
streaming runner may prefer to process and commit small bundles, and a batch
runner may prefer to process larger bundles.</p>
+<h3 id="data-partitioning-and-inter-stage-execution">Data partitioning and
inter-stage execution</h3>
+<p>Partitioning and parallelization of element processing within a Beam
pipeline is
+dependent on two things:</p>
+<ul>
+<li>Data source implementation</li>
+<li>Inter-stage key parallelism</li>
+</ul>
+<p>Beam pipelines read data from a source (e.g. <code>KafkaIO</code>,
<code>BigQueryIO</code>, <code>JdbcIO</code>,
+or your own source implementation). To implement a Source in Beam one must
+implement it as a Splittable <code>DoFn</code>. A Splittable
<code>DoFn</code> provides the runner
+with interfaces to facilitate the splitting of work.</p>
+<p>When running key-based operations in Beam (e.g.
<code>GroupByKey</code>, <code>Combine</code>,
+<code>Reshuffle.perKey</code>, and stateful <code>DoFn</code>s),
Beam runners perform serialization
+and transfer of data known as <em>shuffle</em><sup>1</sup>.
Shuffle allows data
+elements of the same key to be processed together.</p>
+<p>The way in which runners <em>shuffle</em> data may be slightly
different for Batch and
+Streaming execution modes.</p>
+<p><sup>1</sup>Not to be confused with the <code>shuffle</code>
operation in some runners.</p>
+<h4 id="data-ordering-in-a-pipeline-execution">Data ordering in a pipeline
execution</h4>
+<p>The Beam model does not define strict guidelines regarding the order in
which
+runners process elements or transport them across
<code>PTransforms</code>. Runners are
+free to implement data transfer semantics in different forms.</p>
+<p>Some use cases exist where user pipelines may need to rely on specific
ordering
+semantics in pipeline execution. The <a
href="/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html">capability
matrix documents</a>
+runner behavior for <strong>key-ordered delivery</strong>.</p>
+<p>Consider a single Beam worker processing a series of bundles from the
same Beam
+transform, and consider a <code>PTransform</code> that outputs data from
this Stage into a
+downstream <code>PCollection</code>. Finally, consider two events
<em>with the same key</em>
+emitted in a certain order by this worker (within the same bundle or as part of
+different bundles).</p>
+<p>We say that the Beam runner supports <strong>key-ordered
delivery</strong> if it guarantees
+that these two events will be observed in the same order by a PTransform that
is
+immediately downstream independently of the kind of data transmission
method.</p>
+<p>This characteristic will hold true in runners and operations that have
+key-limited parallelism.</p>
<h2 id="parallelism">Failures and parallelism within and between
transforms</h2>
<p>In this section, we discuss how elements in the input collection are
processed
in parallel, and how transforms are retried when failures occur.</p>
diff --git
a/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html
b/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html
index f3ef192..4082d17 100644
---
a/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html
+++
b/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html
@@ -18,7 +18,7 @@
function addPlaceholder(){$('input:text').attr('placeholder',"What are you
looking for?");}
function endSearch(){var
search=document.querySelector(".searchBar");search.classList.add("disappear");var
icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");}
function blockScroll(){$("body").toggleClass("fixedPosition");}
-function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><a class=back-button
href=/documentation/runners/capability-matrix><i class="fas
fa-arrow-left"></i>back to collapsed details</a><h4>Additional common features
not yet part of the Beam model</h4><div class=table-container><div
class="table-left
big-left"><table><tr><th></th></tr><tr><th>Drain</th></tr><tr><th>Checkpoint</th></tr></table></div><div
class="table-right big-right"><div i [...]
+function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><a class=back-button
href=/documentation/runners/capability-matrix><i class="fas
fa-arrow-left"></i>back to collapsed details</a><h4>Additional common features
not yet part of the Beam model</h4><div class=table-container><div
class="table-left
big-left"><table><tr><th></th></tr><tr><th>Drain</th></tr><tr><th>Checkpoint</th></tr><tr><th>Key-ordered
delivery</th></tr></table></div><di [...]
<a href=http://www.apache.org>The Apache Software Foundation</a>
| <a href=/privacy_policy>Privacy Policy</a>
| <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam
logo, and the Apache feather logo are either registered trademarks or
trademarks of The Apache Software Foundation. All other products or name brands
are trademarks of their respective holders, including The Apache Software
Foundation.</div></div></div></div></footer></body></html>
\ No newline at end of file
diff --git
a/website/generated-content/documentation/runners/capability-matrix/index.html
b/website/generated-content/documentation/runners/capability-matrix/index.html
index 057fe9a..245259b 100644
---
a/website/generated-content/documentation/runners/capability-matrix/index.html
+++
b/website/generated-content/documentation/runners/capability-matrix/index.html
@@ -18,7 +18,7 @@
function addPlaceholder(){$('input:text').attr('placeholder',"What are you
looking for?");}
function endSearch(){var
search=document.querySelector(".searchBar");search.classList.add("disappear");var
icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");}
function blockScroll(){$("body").toggleClass("fixedPosition");}
-function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><div class="section-nav closed"
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list
data-section-nav><li><span
class=section-nav-list-main-title>Runners</span></li><li><a
href=/documentation/runners/capability-matrix/>Capability Matrix</a></li><li><a
href=/documentation/runners/direct/>Direct Ru [...]
+function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><div class="section-nav closed"
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list
data-section-nav><li><span
class=section-nav-list-main-title>Runners</span></li><li><a
href=/documentation/runners/capability-matrix/>Capability Matrix</a></li><li><a
href=/documentation/runners/direct/>Direct Ru [...]
<script>$('.table-headers').scroll(function(e){$('#'+this.id+'.table-center').scrollLeft($(this).scrollLeft());});$('.table-center').scroll(function(e){$('#'+this.id+'.table-headers').scrollLeft($(this).scrollLeft());});</script><div
class=feedback><p class=update>Last updated on 2021/02/05</p><h3>Have you
found everything you were looking for?</h3><p class=description>Was it all
useful and clear? Is there anything that you would like to change? Let us
know!</p><button class=load-button> [...]
<a href=http://www.apache.org>The Apache Software Foundation</a>
| <a href=/privacy_policy>Privacy Policy</a>
diff --git
a/website/generated-content/documentation/runners/capability-matrix/index.xml
b/website/generated-content/documentation/runners/capability-matrix/index.xml
index 8ac8d1b..88ede27 100644
---
a/website/generated-content/documentation/runners/capability-matrix/index.xml
+++
b/website/generated-content/documentation/runners/capability-matrix/index.xml
@@ -28,6 +28,9 @@ back to collapsed details
<tr>
<th>Checkpoint</th>
</tr>
+<tr>
+<th>Key-ordered delivery</th>
+</tr>
</table>
</div>
<div class="table-right big-right">
@@ -155,6 +158,65 @@ Samza has a native checkpoint capability.
<br>
</td>
</tr>
+<tr>
+<td style='background-color:#f9f9f9;border-color:#d8d8d8'>
+<b>
+<p>Partially : </p>
+</b>
+<br>
+Dataflow performs different shuffling algorithms for batch and streaming.
Dataflow guarantees key-ordered delivery in streaming, though not in batch.
+</td>
+<td style='background-color:#f9f9f9;border-color:#d8d8d8'>
+<b>
+<p>Partially : </p>
+</b>
+<br>
+Flink may perform different shuffling algorithms for batch and streaming.
Flink guarantees key-ordered delivery in streaming, though not in batch.
+</td>
+<td style='background-color:#e1e0e0;border-color:#bcbcbc'>
+<b>
+<p>Unverified : </p>
+</b>
+<br>
+</td>
+<td style='background-color:#e1e0e0;border-color:#bcbcbc'>
+<b>
+<p>Unverified : </p>
+</b>
+<br>
+</td>
+<td style='background-color:#e1e0e0;border-color:#bcbcbc'>
+<b>
+<p>Unverified : </p>
+</b>
+<br>
+</td>
+<td style='background-color:#f9f9f9;border-color:#d8d8d8'>
+<b>
+<p>Partially : </p>
+</b>
+<br>
+Samza may perform different shuffling algorithms for batch and streaming.
Samza guarantees key-ordered delivery in streaming, though not in batch.
+</td>
+<td style='background-color:#e1e0e0;border-color:#bcbcbc'>
+<b>
+<p>Unverified : </p>
+</b>
+<br>
+</td>
+<td style='background-color:#e1e0e0;border-color:#bcbcbc'>
+<b>
+<p>Unverified : </p>
+</b>
+<br>
+</td>
+<td style='background-color:#e1e0e0;border-color:#bcbcbc'>
+<b>
+<p>Unverified : </p>
+</b>
+<br>
+</td>
+</tr>
</table>
</div>
</div>
diff --git a/website/generated-content/documentation/runtime/model/index.html
b/website/generated-content/documentation/runtime/model/index.html
index f612cc3..2a483c4 100644
--- a/website/generated-content/documentation/runtime/model/index.html
+++ b/website/generated-content/documentation/runtime/model/index.html
@@ -18,7 +18,7 @@
function addPlaceholder(){$('input:text').attr('placeholder',"What are you
looking for?");}
function endSearch(){var
search=document.querySelector(".searchBar");search.classList.add("disappear");var
icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");}
function blockScroll(){$("body").toggleClass("fixedPosition");}
-function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><div class="section-nav closed"
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list
data-section-nav><li><span
class=section-nav-list-main-title>Documentation</span></li><li><a
href=/documentation>Using the Documentation</a></li><li
class=section-nav-item--collapsible><span class=section-nav-lis [...]
+function openMenu(){addPlaceholder();blockScroll();}</script><div
class="clearfix container-main-content"><div class="section-nav closed"
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list
data-section-nav><li><span
class=section-nav-list-main-title>Documentation</span></li><li><a
href=/documentation>Using the Documentation</a></li><li
class=section-nav-item--collapsible><span class=section-nav-lis [...]
may observe various effects as a result of the runner’s choices. This page
describes these effects so you can better understand how Beam pipelines
execute.</p><h2 id=processing-of-elements>Processing of elements</h2><p>The
serialization and communication of elements between machines is one of the
most expensive operations in a distributed execution of your pipeline. Avoiding
@@ -32,7 +32,9 @@ involve serializing elements and communicating them to other
workers.</li><li>Us
serializing the elements and broadcasting them to all the workers executing
the <code>ParDo</code>.</li><li>Passing elements between transforms that are
running on the same worker.
This may allow the runner to avoid serializing elements; instead, the runner
-can just pass the elements in memory.</li></ul><p>Some situations where the
runner may serialize and persist elements are:</p><ol><li>When used as part of
a stateful <code>DoFn</code>, the runner may persist values to some
+can just pass the elements in memory. This is done as part of an
+optimization that is known as
+<a
href=https://beam.apache.org/documentation/glossary/#fusion>fusion</a>.</li></ul><p>Some
situations where the runner may serialize and persist elements
are:</p><ol><li>When used as part of a stateful <code>DoFn</code>, the runner
may persist values to some
state mechanism.</li><li>When committing the results of processing, the runner
may persist the outputs
as a checkpoint.</li></ol><h3 id=bundling-and-persistence>Bundling and
persistence</h3><p>Beam pipelines often focus on “<a
href=https://en.wikipedia.org/wiki/embarrassingly_parallel>embarassingly
parallel</a>”
problems. Because of this, the APIs emphasize processing elements in parallel,
@@ -46,7 +48,26 @@ bundles is arbitrary and selected by the runner. This allows
the runner to
choose an appropriate middle-ground between persisting results after every
element, and having to retry everything if there is a failure. For example, a
streaming runner may prefer to process and commit small bundles, and a batch
-runner may prefer to process larger bundles.</p><h2 id=parallelism>Failures
and parallelism within and between transforms</h2><p>In this section, we
discuss how elements in the input collection are processed
+runner may prefer to process larger bundles.</p><h3
id=data-partitioning-and-inter-stage-execution>Data partitioning and
inter-stage execution</h3><p>Partitioning and parallelization of element
processing within a Beam pipeline is
+dependent on two things:</p><ul><li>Data source
implementation</li><li>Inter-stage key parallelism</li></ul><p>Beam pipelines
read data from a source (e.g. <code>KafkaIO</code>, <code>BigQueryIO</code>,
<code>JdbcIO</code>,
+or your own source implementation). To implement a Source in Beam one must
+implement it as a Splittable <code>DoFn</code>. A Splittable <code>DoFn</code>
provides the runner
+with interfaces to facilitate the splitting of work.</p><p>When running
key-based operations in Beam (e.g. <code>GroupByKey</code>,
<code>Combine</code>,
+<code>Reshuffle.perKey</code>, and stateful <code>DoFn</code>s), Beam runners
perform serialization
+and transfer of data known as <em>shuffle</em><sup>1</sup>. Shuffle allows data
+elements of the same key to be processed together.</p><p>The way in which
runners <em>shuffle</em> data may be slightly different for Batch and
+Streaming execution modes.</p><p><sup>1</sup>Not to be confused with the
<code>shuffle</code> operation in some runners.</p><h4
id=data-ordering-in-a-pipeline-execution>Data ordering in a pipeline
execution</h4><p>The Beam model does not define strict guidelines regarding the
order in which
+runners process elements or transport them across <code>PTransforms</code>.
Runners are
+free to implement data transfer semantics in different forms.</p><p>Some use
cases exist where user pipelines may need to rely on specific ordering
+semantics in pipeline execution. The <a
href=/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html>capability
matrix documents</a>
+runner behavior for <strong>key-ordered delivery</strong>.</p><p>Consider a
single Beam worker processing a series of bundles from the same Beam
+transform, and consider a <code>PTransform</code> that outputs data from this
Stage into a
+downstream <code>PCollection</code>. Finally, consider two events <em>with the
same key</em>
+emitted in a certain order by this worker (within the same bundle or as part of
+different bundles).</p><p>We say that the Beam runner supports
<strong>key-ordered delivery</strong> if it guarantees
+that these two events will be observed in the same order by a PTransform that
is
+immediately downstream independently of the kind of data transmission
method.</p><p>This characteristic will hold true in runners and operations that
have
+key-limited parallelism.</p><h2 id=parallelism>Failures and parallelism within
and between transforms</h2><p>In this section, we discuss how elements in the
input collection are processed
in parallel, and how transforms are retried when failures occur.</p><h3
id=data-parallelism>Data-parallelism within one transform</h3><p>When executing
a single <code>ParDo</code>, a runner might divide an example input
collection of nine elements into two bundles as shown in figure 1.</p><p><img
src=/images/execution_model_bundling.svg alt="Bundle A contains five elements.
Bundle B contains four elements."></p><p><em>Figure 1: A runner divides an
input collection into two bundles.</em></p><p>When the <code>ParDo</code>
executes, workers may process the two bundles in parallel as
shown in figure 2.</p><p><img src=/images/execution_model_bundling_gantt.svg
alt="Two workers process the two bundles in parallel. Worker one processes
bundle A. Worker two processes bundle B."></p><p><em>Figure 2: Two workers
process the two bundles in parallel.</em></p><p>Since elements cannot be split,
the maximum parallelism for a transform depends
@@ -87,7 +108,7 @@ elements in the input bundle must be retried. These two
<code>ParDo</code>s are
the input bundle are retried.</em></p><p>Note that the retry does not
necessarily have the same processing time as the
original attempt, as shown in the diagram.</p><p>All <code>DoFns</code> that
experience coupled failures are terminated and must be torn
down since they aren’t following the normal <code>DoFn</code> lifecycle
.</p><p>Executing transforms this way allows a runner to avoid persisting
elements
-between transforms, saving on persistence costs.</p><div class=feedback><p
class=update>Last updated on 2020/08/28</p><h3>Have you found everything you
were looking for?</h3><p class=description>Was it all useful and clear? Is
there anything that you would like to change? Let us know!</p><button
class=load-button><a href="mailto:[email protected]?subject=Beam Website
Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div
class=footer__contained><div class=foote [...]
+between transforms, saving on persistence costs.</p><div class=feedback><p
class=update>Last updated on 2021/12/03</p><h3>Have you found everything you
were looking for?</h3><p class=description>Was it all useful and clear? Is
there anything that you would like to change? Let us know!</p><button
class=load-button><a href="mailto:[email protected]?subject=Beam Website
Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div
class=footer__contained><div class=foote [...]
<a href=http://www.apache.org>The Apache Software Foundation</a>
| <a href=/privacy_policy>Privacy Policy</a>
| <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam
logo, and the Apache feather logo are either registered trademarks or
trademarks of The Apache Software Foundation. All other products or name brands
are trademarks of their respective holders, including The Apache Software
Foundation.</div></div></div></div></footer></body></html>
\ No newline at end of file
diff --git a/website/generated-content/sitemap.xml
b/website/generated-content/sitemap.xml
index 424aa32..313115d 100644
--- a/website/generated-content/sitemap.xml
+++ b/website/generated-content/sitemap.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-12-01T21:32:04+03:00</lastmod></url><url><loc>/blog/g
[...]
\ No newline at end of file
+<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-12-01T21:32:04+03:00</lastmod></url><url><loc>/blog/g
[...]
\ No newline at end of file