[
https://issues.apache.org/jira/browse/BEAM-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004274#comment-16004274
]
Etienne Chauchot commented on BEAM-849:
---------------------------------------
Here are the differences I have observed in streaming pipelines termination:
- direct runner: when the output watermarks of all of its PCollections progress
to +infinity
- apex runner: when the output watermarks of all of its PCollections progress
to +infinity
- dataflow runner: when the output watermarks of all of its PCollections
progress to +infinity
- spark runner: streaming pipelines do not terminate unless timeout is set in
pipelineResult.waitUntilFinish()
- flink runner: streaming pipelines do not terminate unless timeout is set in
pipelineResult.waitUntilFinish() (thanks to Aljoscha for timeout support PR
https://github.com/apache/beam/pull/2915#pullrequestreview-37090326)
=> Is the direct/apex/dataflow behavior the correct "beam model" behavior?
I know that, at least for spark (mails in this thread), there is no easy way to
know that we're done reading a source, so it might be very difficult (at least
for this runner) to unify toward +infinity behavior if it is chosen as the
standard behavior.
> Redesign PipelineResult API
> ---------------------------
>
> Key: BEAM-849
> URL: https://issues.apache.org/jira/browse/BEAM-849
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Reporter: Pei He
>
> Current state:
> Jira https://issues.apache.org/jira/browse/BEAM-443 addresses
> waitUntilFinish() and cancel().
> However, there are additional work around PipelineResult:
> need clearly defined contract and verification across all runners
> need to revisit how to handle metrics/aggregators
> need to be able to get logs
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)