[ 
https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743343#comment-15743343
 ] 

ASF GitHub Bot commented on FLINK-3257:
---------------------------------------

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/1668
  
    Thanks for the reminder, I went over the code today. The code looks mostly 
good, but here are some thoughts:
    
      - The head task supports only one concurrent checkpoint. In general, the 
tasks need to support multiple checkpoints being in progress at the same time. 
It frequently happens when people trigger savepoints concurrent to a running 
checkpoint. I think that is important to support.
    
      - There tail task offers the elements to the blocking queue. That means 
records are simply dropped if the capacity bound queue (one element) is not 
polled by the head task in time.
    
      - With the capacity bound in the feedback queue, it is pretty easy to 
build a full deadlock. Just use a loop function that explodes data into the 
feedback channel.
    
      - Recent code also introduced the ability to change parallelism. What are 
the semantics here when the parallelism of the loop is changed?
    
    Since loops did not support any fault tolerance guarantees, I guess this 
does improve recovery behavior. But as long as the loops can either deadlock or 
drop data, the hard guarantees are in the end still a bit weak. So that leaves 
me a bit ambivalent what to do with this pull request.



> Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
> -------------------------------------------------------------------
>
>                 Key: FLINK-3257
>                 URL: https://issues.apache.org/jira/browse/FLINK-3257
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Paris Carbone
>            Assignee: Paris Carbone
>
> The current snapshotting algorithm cannot support cycles in the execution 
> graph. An alternative scheme can potentially include records in-transit 
> through the back-edges of a cyclic execution graph (ABS [1]) to achieve the 
> same guarantees.
> One straightforward implementation of ABS for cyclic graphs can work as 
> follows along the lines:
> 1) Upon triggering a barrier in an IterationHead from the TaskManager start 
> block output and start upstream backup of all records forwarded from the 
> respective IterationSink.
> 2) The IterationSink should eventually forward the current snapshotting epoch 
> barrier to the IterationSource.
> 3) Upon receiving a barrier from the IterationSink, the IterationSource 
> should finalize the snapshot, unblock its output and emit all records 
> in-transit in FIFO order and continue the usual execution.
> --
> Upon restart the IterationSource should emit all records from the injected 
> snapshot first and then continue its usual execution.
> Several optimisations and slight variations can be potentially achieved but 
> this can be the initial implementation take.
> [1] http://arxiv.org/abs/1506.08603



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to