aaltay commented on a change in pull request #12252: URL: https://github.com/apache/beam/pull/12252#discussion_r454689609
########## File path: website/www/site/content/en/documentation/transforms/python/aggregation/combineglobally.md ########## @@ -14,29 +14,197 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> + # CombineGlobally -<table align="left"> - <a target="_blank" class="button" - href="https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineGlobally"> - <img src="https://beam.apache.org/images/logos/sdks/python.png" width="20px" height="20px" - alt="Pydoc" /> - Pydoc - </a> -</table> -<br><br> +{{< localstorage language language-py >}} +{{< button-pydoc path="apache_beam.transforms.core" class="CombineGlobally" >}} Combines all elements in a collection. See more information in the [Beam Programming Guide](/documentation/programming-guide/#combine). ## Examples -See [BEAM-7390](https://issues.apache.org/jira/browse/BEAM-7390) for updates. -## Related transforms +In the following examples, we create a pipeline with a `PCollection` of produce. +Then, we apply `CombineGlobally` in multiple ways to combine all the elements in the `PCollection`. + +`CombineGlobally` accepts a function that takes a list of elements as an input, and combines them to return a single element. + +### Example 1: Combining with a function + +We define a function `get_common_items` which takes a list of sets as an input, and calculates the intersection (common items) of those sets. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_function >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 2: Combining with a lambda function + +We can also use lambda functions to simplify **Example 1**. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_lambda >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 3: Combining with multiple arguments + +You can pass functions with multiple arguments to `CombineGlobally`. +They are passed as additional positional arguments or keyword arguments to the function. + +In this example, the lambda function takes `sets` and `exclude` as arguments. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_multiple_arguments >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items_with_exceptions >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 4: Combining with side inputs as singletons + +If the `PCollection` has a single value, such as the average from another computation, +passing the `PCollection` as a *singleton* accesses that value. + +In this example, we pass a `PCollection` the value `'🥕'` as a singleton. +We then use that value to exclude specific items. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_side_inputs_singleton >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items_with_exceptions >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 5: Combining with side inputs as iterators + +If the `PCollection` has multiple values, pass the `PCollection` as an *iterator*. +This accesses elements lazily as they are needed, +so it is possible to iterate over large `PCollection`s that won't fit into memory. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_side_inputs_iter >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items_with_exceptions >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +> **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`, +> but this requires that all the elements fit into memory. + +### Example 6: Combining with side inputs as dictionaries + +If a `PCollection` is small enough to fit into memory, then that `PCollection` can be passed as a *dictionary*. +Each element must be a `(key, value)` pair. +Note that all the elements of the `PCollection` must fit into memory for this. +If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_side_inputs_dict >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" custom_common_items >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 7: Combining with a `CombineFn` + +The more general way to combine elements, and the most flexible, is with a class that inherits from `CombineFn`. + +* [`CombineFn.create_accumulator()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.create_accumulator): + Called *once per `CombineFn` instance* when the `CombineFn` instance is initialized. + This creates an empty accumulator. + For example, an empty accumulator for a sum would be `0`, while an empty accumulator for a product (multiplication) would be `1`. + +* [`CombineFn.add_input()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.add_input): + Called *once per element*. + Takes an accumulator and an input element, combines them and returns the updated accumulator. + +* [`CombineFn.merge_accumulators()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.merge_accumulators): + Called *once per bundle of elements* after processing the last element of the bundle. + Multiple accumulators could be processed in parallel, so this function helps merging them into a single accumulator. + +* [`CombineFn.extract_output()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.extract_output): Review comment: "Called *once per `CombineFn` instance* when the `CombineFn` instance is done." -> this will be runner dependent. I would drop this. "After all accumulators have been merged into a single final accumulator, `extract_output` allows to do additional calculations." -> I will simplify this to drop the part about merging. This could be called without any merging (e.g. the case where there are not other accumulators to merge.) "This is useful for calculating averages, percentages, or anything that needs aggregate information from all the elements." -> This describers combiners in general. Combiners need the whole interface including extract_output to to accomplish this goal. ########## File path: website/www/site/content/en/documentation/transforms/python/aggregation/combineglobally.md ########## @@ -14,29 +14,197 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> + # CombineGlobally -<table align="left"> - <a target="_blank" class="button" - href="https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineGlobally"> - <img src="https://beam.apache.org/images/logos/sdks/python.png" width="20px" height="20px" - alt="Pydoc" /> - Pydoc - </a> -</table> -<br><br> +{{< localstorage language language-py >}} +{{< button-pydoc path="apache_beam.transforms.core" class="CombineGlobally" >}} Combines all elements in a collection. See more information in the [Beam Programming Guide](/documentation/programming-guide/#combine). ## Examples -See [BEAM-7390](https://issues.apache.org/jira/browse/BEAM-7390) for updates. -## Related transforms +In the following examples, we create a pipeline with a `PCollection` of produce. +Then, we apply `CombineGlobally` in multiple ways to combine all the elements in the `PCollection`. + +`CombineGlobally` accepts a function that takes a list of elements as an input, and combines them to return a single element. + +### Example 1: Combining with a function + +We define a function `get_common_items` which takes a list of sets as an input, and calculates the intersection (common items) of those sets. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_function >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 2: Combining with a lambda function + +We can also use lambda functions to simplify **Example 1**. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_lambda >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 3: Combining with multiple arguments + +You can pass functions with multiple arguments to `CombineGlobally`. +They are passed as additional positional arguments or keyword arguments to the function. + +In this example, the lambda function takes `sets` and `exclude` as arguments. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_multiple_arguments >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items_with_exceptions >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 4: Combining with side inputs as singletons + +If the `PCollection` has a single value, such as the average from another computation, +passing the `PCollection` as a *singleton* accesses that value. + +In this example, we pass a `PCollection` the value `'🥕'` as a singleton. +We then use that value to exclude specific items. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_side_inputs_singleton >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items_with_exceptions >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 5: Combining with side inputs as iterators + +If the `PCollection` has multiple values, pass the `PCollection` as an *iterator*. +This accesses elements lazily as they are needed, +so it is possible to iterate over large `PCollection`s that won't fit into memory. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_side_inputs_iter >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" common_items_with_exceptions >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +> **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`, +> but this requires that all the elements fit into memory. + +### Example 6: Combining with side inputs as dictionaries + +If a `PCollection` is small enough to fit into memory, then that `PCollection` can be passed as a *dictionary*. +Each element must be a `(key, value)` pair. +Note that all the elements of the `PCollection` must fit into memory for this. +If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead. + +{{< highlight py >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" combineglobally_side_inputs_dict >}} +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection` after `CombineGlobally`: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally_test.py" custom_common_items >}} +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/aggregation/combineglobally.py" >}} + +### Example 7: Combining with a `CombineFn` + +The more general way to combine elements, and the most flexible, is with a class that inherits from `CombineFn`. + +* [`CombineFn.create_accumulator()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.create_accumulator): + Called *once per `CombineFn` instance* when the `CombineFn` instance is initialized. + This creates an empty accumulator. + For example, an empty accumulator for a sum would be `0`, while an empty accumulator for a product (multiplication) would be `1`. + +* [`CombineFn.add_input()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.add_input): + Called *once per element*. + Takes an accumulator and an input element, combines them and returns the updated accumulator. + +* [`CombineFn.merge_accumulators()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombineFn.merge_accumulators): + Called *once per bundle of elements* after processing the last element of the bundle. Review comment: > @aaltay can you confirm if this and the rest of the descriptions are correct? Thanks! Will do. There is no guarantee that `merge_accumulators` will be called once per bundle. It could be called more than once depending on the runner implementation. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
