[ 
https://issues.apache.org/jira/browse/FLINK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115129#comment-15115129
 ] 

ASF GitHub Bot commented on FLINK-3261:
---------------------------------------

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1537#issuecomment-174491828
  
    Good fix.
    
    I think if we want to be on the super safe side, actually all tasks should 
always report back to the JobManager that they received the "TriggerCheckpoint" 
message (either Ack or Decline).
    
    The CheckpointCoordinator would then "ask" the TaskManagers and would 
cancel the checkpoint if some of the asks time out (3-5 secs or so). That way, 
lost messages are properly accounted for.


> Tasks should eagerly report back when they cannot start a checkpoint
> --------------------------------------------------------------------
>
>                 Key: FLINK-3261
>                 URL: https://issues.apache.org/jira/browse/FLINK-3261
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: 0.10.1
>            Reporter: Stephan Ewen
>            Assignee: Aljoscha Krettek
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> With very fast checkpoint intervals (few 100 msecs), it can happen that a 
> Task is not ready to start a checkpoint by the time it gets the first 
> checkpoint trigger message.
> If some other tasks are ready already and commence a checkpoint, the stream 
> alignment will make the non-participating task wait until the checkpoint 
> expires (default: 10 minutes).
> A simple way to fix this is that tasks report back when they could not start 
> a checkpoint. The checkpoint coordinator can then abort that checkpoint and 
> unblock the streams by starting new checkpoint (where all tasks will 
> participate).
> An optimization would be to send a special "abort checkpoints barrier" that 
> tells the barrier buffers for stream alignment to unblock a checkpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to