[
https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743343#comment-15743343
]
ASF GitHub Bot commented on FLINK-3257:
---------------------------------------
Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/1668
Thanks for the reminder, I went over the code today. The code looks mostly
good, but here are some thoughts:
- The head task supports only one concurrent checkpoint. In general, the
tasks need to support multiple checkpoints being in progress at the same time.
It frequently happens when people trigger savepoints concurrent to a running
checkpoint. I think that is important to support.
- There tail task offers the elements to the blocking queue. That means
records are simply dropped if the capacity bound queue (one element) is not
polled by the head task in time.
- With the capacity bound in the feedback queue, it is pretty easy to
build a full deadlock. Just use a loop function that explodes data into the
feedback channel.
- Recent code also introduced the ability to change parallelism. What are
the semantics here when the parallelism of the loop is changed?
Since loops did not support any fault tolerance guarantees, I guess this
does improve recovery behavior. But as long as the loops can either deadlock or
drop data, the hard guarantees are in the end still a bit weak. So that leaves
me a bit ambivalent what to do with this pull request.
> Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
> -------------------------------------------------------------------
>
> Key: FLINK-3257
> URL: https://issues.apache.org/jira/browse/FLINK-3257
> Project: Flink
> Issue Type: Improvement
> Reporter: Paris Carbone
> Assignee: Paris Carbone
>
> The current snapshotting algorithm cannot support cycles in the execution
> graph. An alternative scheme can potentially include records in-transit
> through the back-edges of a cyclic execution graph (ABS [1]) to achieve the
> same guarantees.
> One straightforward implementation of ABS for cyclic graphs can work as
> follows along the lines:
> 1) Upon triggering a barrier in an IterationHead from the TaskManager start
> block output and start upstream backup of all records forwarded from the
> respective IterationSink.
> 2) The IterationSink should eventually forward the current snapshotting epoch
> barrier to the IterationSource.
> 3) Upon receiving a barrier from the IterationSink, the IterationSource
> should finalize the snapshot, unblock its output and emit all records
> in-transit in FIFO order and continue the usual execution.
> --
> Upon restart the IterationSource should emit all records from the injected
> snapshot first and then continue its usual execution.
> Several optimisations and slight variations can be potentially achieved but
> this can be the initial implementation take.
> [1] http://arxiv.org/abs/1506.08603
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)