[
https://issues.apache.org/jira/browse/FLINK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139418#comment-17139418
]
Roman Khachatryan commented on FLINK-18336:
-------------------------------------------
> This means a failure for a checkpoint that is already subsumed by a newer
> checkpoint (higher checkpoint ID) ?
Yes, and the other case is decline of an older savepoint (which I believe are
not subsumed).
> In that case, should the failure manager maintain something like "latest
> successful checkpoint" and ignore all failures / messages from older
> (subsumed) checkpoint attempts?
It should work and it's quite simple and efficient.
> CheckpointFailureManager forgets failed checkpoints after a successful one
> --------------------------------------------------------------------------
>
> Key: FLINK-18336
> URL: https://issues.apache.org/jira/browse/FLINK-18336
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Reporter: Roman Khachatryan
> Assignee: Roman Khachatryan
> Priority: Major
> Labels: pull-request-available
>
> To my understanding, failure shouldn't be counted more than once for a single
> checkpoint.
> However, after a successful checkpoint, all previous failures are cleared.
> So this test will currently fail:
>
> {code:java}
> TestFailJobCallback callback = new TestFailJobCallback();
> CheckpointFailureManager failureManager = new CheckpointFailureManager(2,
> callback);
> failureManager.handleJobLevelCheckpointException(new
> CheckpointException(CHECKPOINT_EXPIRED), 1L);
> failureManager.handleJobLevelCheckpointException(new
> CheckpointException(CHECKPOINT_EXPIRED), 2L);
> failureManager.handleCheckpointSuccess(2L);
> failureManager.handleJobLevelCheckpointException(new
> CheckpointException(CHECKPOINT_EXPIRED), 3L);
> failureManager.handleJobLevelCheckpointException(new
> CheckpointException(CHECKPOINT_EXPIRED), 4L);
> // shouldn't be counted because 1L has already failed:
> failureManager.handleJobLevelCheckpointException(new
> CheckpointException(CHECKPOINT_EXPIRED), 1L);
> assertEquals(0, callback.getInvokeCounter());{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)