[
https://issues.apache.org/jira/browse/BEAM-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090920#comment-17090920
]
Luke Cwik commented on BEAM-7199:
---------------------------------
The SDF expansion is PairWithRestriction -> InitialSplittingWithSizing ->
Reshuffle -> ProcessSizedElementsAndRestrictions
Implementation here:
https://github.com/apache/beam/blob/ec67a9374671ea9ae670fb0f3935ead2ebed7981/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/graph/SplittableParDoExpander.java#L68
The optimization is the initial splitting happening during executing as data in
the pipeline and the reshuffle enabling runners to "redistribute" the work
across multiple workers.
The Combiner optimization should have a similar expansion being done like the
SDF one. The expansion is documented here:
https://docs.google.com/document/d/1-3mEs3Y7bIkJ0hmQ6SiHpVIFu5vbY6Zcpw-7tOMVg4U/edit#heading=h.eojkgyq8j323
> Better optimize Portable pipelines
> ----------------------------------
>
> Key: BEAM-7199
> URL: https://issues.apache.org/jira/browse/BEAM-7199
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Ankur Goenka
> Priority: Major
> Labels: portability
>
> Python has an experimental flag pre_optimize=all which does pre-optimization
> of python pipelines by fusing operators.
> Python optimization is expected to be better than the one in java because it
> has more information about the pipeline.
> Make java pipeline optimization at par with python so that the benefits can
> be shared by all languages.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)