[
https://issues.apache.org/jira/browse/BEAM-11758?focusedWorklogId=668700&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-668700
]
ASF GitHub Bot logged work on BEAM-11758:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 21/Oct/21 23:31
Start Date: 21/Oct/21 23:31
Worklog Time Spent: 10m
Work Description: ibzib commented on a change in pull request #15763:
URL: https://github.com/apache/beam/pull/15763#discussion_r734111716
##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -137,45 +147,158 @@ windowing strategy, and
[GroupByKey](#implementing-the-groupbykey-and-window-primitive) primitive,
which has behavior governed by the windowing strategy.
-### User-Defined Functions (UDFs)
-
-Beam has seven varieties of user-defined function (UDF). A Beam pipeline
-may contain UDFs written in a language other than your runner, or even multiple
-languages in the same pipeline (see the [Runner API](#the-runner-api)) so the
-definitions are language-independent (see the [Fn API](#the-fn-api)).
-
-The UDFs of Beam are:
-
- * _DoFn_ - per-element processing function (used in ParDo)
- * _WindowFn_ - places elements in windows and merges windows (used in Window
- and GroupByKey)
+### Aggregation
+
+Aggregation is computing a value from multiple (1 or more) input elements. In
+Beam, the primary computational pattern for aggregation is to group all
elements
+with a common key and window then combine each group of elements using an
+associative and commutative operation. This is similar to the "Reduce"
operation
+in the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) model, though it is
+enhanced to work with unbounded input streams as well as bounded data sets.
+
+<img src="/images/aggregation.png" alt="Aggregation of elements."
width="120px">
+
+*Figure 1: Aggregation of elements. Elements with the same color represent
those
+with a common key and window.*
+
+Some simple aggregation transforms include `Count` (computes the count of all
+elements in the aggregation), `Max` (computes the maximum element in the
+aggregation), and `Sum` (computes the sum of all elements in the aggregation).
+
+When elements are grouped and emitted as a bag, the aggregation is known as
+`GroupByKey` (the associative/commutative operation is bag union). In this
case,
+the output is no smaller than the input. Often, you will apply an operation
such
+as summation, called a `CombineFn`, in which the output is significantly
smaller
+than the input. In this case the aggregation is called `CombinePerKey`.
+
+The associativity and commutativity of a `CombineFn` allows runners to
+automatically apply some optimizations:
+
+ * _Combiner lifting_: This is the most significant optimization. Input
elements
+ are combined per key and window before they are shuffled, so the volume of
+ data shuffled might be reduced by many orders of magnitude. Another term for
+ this optimization is "mapper-side combine."
+ * _Incremental combining_: When you have a `CombineFn` that reduces the data
+ size by a lot, it is useful to combine elements as they emerge from a
+ streaming shuffle. This spreads out the cost of doing combines over the time
+ that your streaming computation might be idle. Incremental combining also
+ reduces the storage of intermediate accumulators.
+
+In a real application, you might have millions of keys and/or windows; that is
+why this is still an "embarassingly parallel" computational pattern. In those
+cases where you have fewer keys, you can add parallelism by adding a
+supplementary key, splitting each of your problem's natural keys into many
+sub-keys. After these sub-keys are aggregated, the results can be further
+combined into a result for the original natural key for your problem. The
+associativity of your aggregation function ensures that this yields the same
+answer, but with more parallelism.
+
+When your input is unbounded, the computational pattern of grouping elements by
+key and window is roughly the same, but governing when and how to emit the
+results of aggregation involves three concepts:
+
+ * Windowing, which partitions your input into bounded subsets that can be
+ complete.
+ * Watermarks, which estimate the completeness of your input.
+ * Triggers, which govern when and how to emit aggregated results.
+
+For more information about available aggregation transforms, see the following
+pages:
+
+ * [Beam Programming Guide: Core Beam
transforms](/documentation/programming-guide/#core-beam-transforms)
+ * Beam Transform catalog
+ ([Java](/documentation/transforms/java/overview/#aggregation),
+ [Python](/documentation/transforms/python/overview/#aggregation))
+
+### User-defined function (UDF)
+
+Some Beam operations allow you to run user-defined code as a way of configuring
+the transform. For example, when using `ParDo`, user-defined code specifies
what
+operation to apply to every element. For `Combine`, it specifies how values
+should be combined. A Beam pipeline can contain UDFs written in a language
other
+than your runner, or even multiple languages in the same pipeline so the
+definitions are language-independent.
+
+Beam has seven varieties of UDFs:
+
+ * _DoFn_ - per-element processing function (used in `ParDo`)
+ * _WindowFn_ - places elements in windows and merges windows (used in `Window`
+ and `GroupByKey`)
* _Source_ - emits data read from external sources, including initial and
- dynamic splitting for parallelism (used in Read)
- * _ViewFn_ - adapts a materialized PCollection to a particular interface (used
- in side inputs)
+ dynamic splitting for parallelism (used in `Read`)
Review comment:
The entire bullet item.
##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -137,45 +147,158 @@ windowing strategy, and
[GroupByKey](#implementing-the-groupbykey-and-window-primitive) primitive,
which has behavior governed by the windowing strategy.
-### User-Defined Functions (UDFs)
-
-Beam has seven varieties of user-defined function (UDF). A Beam pipeline
-may contain UDFs written in a language other than your runner, or even multiple
-languages in the same pipeline (see the [Runner API](#the-runner-api)) so the
-definitions are language-independent (see the [Fn API](#the-fn-api)).
-
-The UDFs of Beam are:
-
- * _DoFn_ - per-element processing function (used in ParDo)
- * _WindowFn_ - places elements in windows and merges windows (used in Window
- and GroupByKey)
+### Aggregation
+
+Aggregation is computing a value from multiple (1 or more) input elements. In
+Beam, the primary computational pattern for aggregation is to group all
elements
+with a common key and window then combine each group of elements using an
+associative and commutative operation. This is similar to the "Reduce"
operation
+in the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) model, though it is
+enhanced to work with unbounded input streams as well as bounded data sets.
+
+<img src="/images/aggregation.png" alt="Aggregation of elements."
width="120px">
+
+*Figure 1: Aggregation of elements. Elements with the same color represent
those
+with a common key and window.*
+
+Some simple aggregation transforms include `Count` (computes the count of all
+elements in the aggregation), `Max` (computes the maximum element in the
+aggregation), and `Sum` (computes the sum of all elements in the aggregation).
+
+When elements are grouped and emitted as a bag, the aggregation is known as
+`GroupByKey` (the associative/commutative operation is bag union). In this
case,
+the output is no smaller than the input. Often, you will apply an operation
such
+as summation, called a `CombineFn`, in which the output is significantly
smaller
+than the input. In this case the aggregation is called `CombinePerKey`.
+
+The associativity and commutativity of a `CombineFn` allows runners to
+automatically apply some optimizations:
+
+ * _Combiner lifting_: This is the most significant optimization. Input
elements
Review comment:
Sounds good. I agree that the programming guide is a better place to
explain optimizations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 668700)
Time Spent: 1.5h (was: 1h 20m)
> Create concepts guide in the Beam documentation
> -----------------------------------------------
>
> Key: BEAM-11758
> URL: https://issues.apache.org/jira/browse/BEAM-11758
> Project: Beam
> Issue Type: New Feature
> Components: website
> Reporter: David Huntsperger
> Assignee: Melissa Pashniak
> Priority: P3
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> Create a conceptual guide to help new users understand Beam concepts.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)