Congxian Qiu(klion26) created FLINK-13861:
---------------------------------------------
Summary: No new checkpoint trigged when canceling an expired
checkpoint failed
Key: FLINK-13861
URL: https://issues.apache.org/jira/browse/FLINK-13861
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.9.0, 1.8.1, 1.7.2
Reporter: Congxian Qiu(klion26)
Fix For: 1.10.0
I encountered this problem in our private fork of Flink, after taking a look at
the current master branch of Apache Flink, I think the problem exists here also.
Problem Detail:
1. checkpoint canceled because of expiration, so will call the canceller such
as below
{code:java}
final Runnable canceller = () -> {
synchronized (lock) {
// only do the work if the checkpoint is not discarded anyways
// note that checkpoint completion discards the pending checkpoint object
if (!checkpoint.isDiscarded()) {
LOG.info("Checkpoint {} of job {} expired before completing.",
checkpointID, job);
failPendingCheckpoint(checkpoint,
CheckpointFailureReason.CHECKPOINT_EXPIRED);
pendingCheckpoints.remove(checkpointID);
rememberRecentCheckpointId(checkpointID);
triggerQueuedRequests();
}
}
};{code}
But failPendingCheckpoint may throw exceptions because it will call
{{CheckpointCoordinator#failPendingCheckpoint}}
-> {{PendingCheckpoint#abort}}
-> {{PendingCheckpoint#reportFailedCheckpoint}}
-> initialize a FailedCheckpointStates, may throw an exception by
{{checkArgument}}
Did not find more about why there ever failed the {{checkArgument
currently(this problem did not reproduce frequently)}}, will create an issue
for that if I have more findings.
2. when trigger checkpoint next, we'll first check if there already are too
many checkpoints such as below
{code:java}
private void checkConcurrentCheckpoints() throws CheckpointException {
if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
triggerRequestQueued = true;
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
throw new
CheckpointException(CheckpointFailureReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
}
}
{code}
the {{pendingCheckpoints.zie() >= maxConcurrentCheckpoitnAttempts}} will always
true
3. no checkpoint will be triggered ever from that on.
Because of the {{failPendingCheckpoint}} may throw Exception, so we may place
the remove pending checkpoint logic in a finally block.
I'd like to file a pr for this if this really needs to fix.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)