[
https://issues.apache.org/jira/browse/FLINK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115155#comment-15115155
]
ASF GitHub Bot commented on FLINK-3261:
---------------------------------------
Github user senorcarbone commented on the pull request:
https://github.com/apache/flink/pull/1537#issuecomment-174500162
Looks cool.
Just so I understand exactly, what is wrong again if the Coordinator simply
aborts expired checkpoint attempts? Wouldn't the protocol be the same, with
less messages? If a task is not ready it can simply discard the checkpoint
request which will eventually time out at the Coordinator. The Coordinator
attempts might potentially keep timing out but there will be a complete
snapshot eventually when all tasks are ready.
> Tasks should eagerly report back when they cannot start a checkpoint
> --------------------------------------------------------------------
>
> Key: FLINK-3261
> URL: https://issues.apache.org/jira/browse/FLINK-3261
> Project: Flink
> Issue Type: Bug
> Components: Distributed Runtime
> Affects Versions: 0.10.1
> Reporter: Stephan Ewen
> Assignee: Aljoscha Krettek
> Priority: Blocker
> Fix For: 1.0.0
>
>
> With very fast checkpoint intervals (few 100 msecs), it can happen that a
> Task is not ready to start a checkpoint by the time it gets the first
> checkpoint trigger message.
> If some other tasks are ready already and commence a checkpoint, the stream
> alignment will make the non-participating task wait until the checkpoint
> expires (default: 10 minutes).
> A simple way to fix this is that tasks report back when they could not start
> a checkpoint. The checkpoint coordinator can then abort that checkpoint and
> unblock the streams by starting new checkpoint (where all tasks will
> participate).
> An optimization would be to send a special "abort checkpoints barrier" that
> tells the barrier buffers for stream alignment to unblock a checkpoint.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)