[ 
https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172094#comment-15172094
 ] 

ASF GitHub Bot commented on FLINK-3257:
---------------------------------------

Github user senorcarbone commented on the pull request:

    https://github.com/apache/flink/pull/1668#issuecomment-190279441
  
    You can find an alternative version using `ListState` in the following 
branch:
    https://github.com/senorcarbone/flink/commits/ftloopsalt
    So I noticed that this version is quite **slower** than the one with custom 
operator state but it can support larger states apparently.
    
    I am (ab)using the PartitionedState to store the ListState in the same key, 
as @gyfora suggested since it is the only way to obtain the nice 
representations at the moment. It would be nice to have them available for 
operator state snapshots as well - @aljoscha have you thought about it? When 
there is free time (after the release) it would be nice to see what @aljoscha 
and @StephanEwen think of the two takes as well. No hurries, just take a look 
when you have time!
    
    The two annoying issues I noticed during testing and we need to check soon 
are the following:
    
    - The overhead of transmitting and finally delivering a barrier from the 
`head` to its consumers increases in time (for each subsequent checkpoint). 
That is due to having a single queue at the beginning of the iterative part of 
the job. Events coming from the backedge are pushed further behind the input 
queue.  It would be nice to have take events in round robin among the two input 
gates (iteration source, regular input). Otherwise, checkpoints in iterative 
jobs can be really prolonged in time due to this.
    
    - We need a proper way to deal with deadlocks. I removed the part where we 
discard events in the tail upon timeout since that boils down to at most once 
semantics. This PR is not solving deadlocks but I think we should find a 
graceful way to tackle them. (@uce, any ideas? )


> Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
> -------------------------------------------------------------------
>
>                 Key: FLINK-3257
>                 URL: https://issues.apache.org/jira/browse/FLINK-3257
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Paris Carbone
>            Assignee: Paris Carbone
>
> The current snapshotting algorithm cannot support cycles in the execution 
> graph. An alternative scheme can potentially include records in-transit 
> through the back-edges of a cyclic execution graph (ABS [1]) to achieve the 
> same guarantees.
> One straightforward implementation of ABS for cyclic graphs can work as 
> follows along the lines:
> 1) Upon triggering a barrier in an IterationHead from the TaskManager start 
> block output and start upstream backup of all records forwarded from the 
> respective IterationSink.
> 2) The IterationSink should eventually forward the current snapshotting epoch 
> barrier to the IterationSource.
> 3) Upon receiving a barrier from the IterationSink, the IterationSource 
> should finalize the snapshot, unblock its output and emit all records 
> in-transit in FIFO order and continue the usual execution.
> --
> Upon restart the IterationSource should emit all records from the injected 
> snapshot first and then continue its usual execution.
> Several optimisations and slight variations can be potentially achieved but 
> this can be the initial implementation take.
> [1] http://arxiv.org/abs/1506.08603



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to