[
https://issues.apache.org/jira/browse/FLINK-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yu Li updated FLINK-13593:
--------------------------
Summary: Prevent failing the wrong execution attempt in
CheckpointFailureManager (was: Prevent failing the wrong job in
CheckpointFailureManager)
> Prevent failing the wrong execution attempt in CheckpointFailureManager
> -----------------------------------------------------------------------
>
> Key: FLINK-13593
> URL: https://issues.apache.org/jira/browse/FLINK-13593
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.9.0
> Reporter: Yu Li
> Assignee: Yu Li
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Due to the asynchronously handling of checkpoint decline message in
> {{LegacyScheduler#declineCheckpoint}}, it's possible that the message is
> handled before job status transition thus {{receiveDeclineMessage}} grabbed
> the lock in {{CheckpointCoordinator}} before {{pendingCheckpoints}} got
> cleared by {{stopCheckpointScheduler}} (as triggered by the job status
> listener {{CheckpointCoordinatorDeActivator}}). And if the job/tasks restarts
> quickly enough, the {{FailJobCallback}} in {{CheckpointFailureManager}} might
> unexpectedly fail the job again, as observed in FLINK-13527.
> To resolve the issue, we need to add a safe guard when failing the job,
> passing through the {{ExecutionAttemptID}} and checking against the current
> executions to make sure the to-be-failed one is still running, so we won't
> fail the newly restarted one by accident.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)