[
https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172094#comment-15172094
]
ASF GitHub Bot commented on FLINK-3257:
---------------------------------------
Github user senorcarbone commented on the pull request:
https://github.com/apache/flink/pull/1668#issuecomment-190279441
You can find an alternative version using `ListState` in the following
branch:
https://github.com/senorcarbone/flink/commits/ftloopsalt
So I noticed that this version is quite **slower** than the one with custom
operator state but it can support larger states apparently.
I am (ab)using the PartitionedState to store the ListState in the same key,
as @gyfora suggested since it is the only way to obtain the nice
representations at the moment. It would be nice to have them available for
operator state snapshots as well - @aljoscha have you thought about it? When
there is free time (after the release) it would be nice to see what @aljoscha
and @StephanEwen think of the two takes as well. No hurries, just take a look
when you have time!
The two annoying issues I noticed during testing and we need to check soon
are the following:
- The overhead of transmitting and finally delivering a barrier from the
`head` to its consumers increases in time (for each subsequent checkpoint).
That is due to having a single queue at the beginning of the iterative part of
the job. Events coming from the backedge are pushed further behind the input
queue. It would be nice to have take events in round robin among the two input
gates (iteration source, regular input). Otherwise, checkpoints in iterative
jobs can be really prolonged in time due to this.
- We need a proper way to deal with deadlocks. I removed the part where we
discard events in the tail upon timeout since that boils down to at most once
semantics. This PR is not solving deadlocks but I think we should find a
graceful way to tackle them. (@uce, any ideas? )
> Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
> -------------------------------------------------------------------
>
> Key: FLINK-3257
> URL: https://issues.apache.org/jira/browse/FLINK-3257
> Project: Flink
> Issue Type: Improvement
> Reporter: Paris Carbone
> Assignee: Paris Carbone
>
> The current snapshotting algorithm cannot support cycles in the execution
> graph. An alternative scheme can potentially include records in-transit
> through the back-edges of a cyclic execution graph (ABS [1]) to achieve the
> same guarantees.
> One straightforward implementation of ABS for cyclic graphs can work as
> follows along the lines:
> 1) Upon triggering a barrier in an IterationHead from the TaskManager start
> block output and start upstream backup of all records forwarded from the
> respective IterationSink.
> 2) The IterationSink should eventually forward the current snapshotting epoch
> barrier to the IterationSource.
> 3) Upon receiving a barrier from the IterationSink, the IterationSource
> should finalize the snapshot, unblock its output and emit all records
> in-transit in FIFO order and continue the usual execution.
> --
> Upon restart the IterationSource should emit all records from the injected
> snapshot first and then continue its usual execution.
> Several optimisations and slight variations can be potentially achieved but
> this can be the initial implementation take.
> [1] http://arxiv.org/abs/1506.08603
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)