[jira] [Updated] (FLINK-13593) Prevent failing the wrong execution attempt in CheckpointFailureManager

Yu Li (JIRA) Tue, 06 Aug 2019 01:18:15 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yu Li updated FLINK-13593:
--------------------------
    Summary: Prevent failing the wrong execution attempt in 
CheckpointFailureManager  (was: Prevent failing the wrong job in 
CheckpointFailureManager)

> Prevent failing the wrong execution attempt in CheckpointFailureManager
> -----------------------------------------------------------------------
>
>                 Key: FLINK-13593
>                 URL: https://issues.apache.org/jira/browse/FLINK-13593
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Due to the asynchronously handling of checkpoint decline message in 
> {{LegacyScheduler#declineCheckpoint}}, it's possible that the message is 
> handled before job status transition thus {{receiveDeclineMessage}} grabbed 
> the lock in {{CheckpointCoordinator}} before {{pendingCheckpoints}} got 
> cleared by {{stopCheckpointScheduler}} (as triggered by the job status 
> listener {{CheckpointCoordinatorDeActivator}}). And if the job/tasks restarts 
> quickly enough, the {{FailJobCallback}} in {{CheckpointFailureManager}} might 
> unexpectedly fail the job again, as observed in FLINK-13527.
> To resolve the issue, we need to add a safe guard when failing the job, 
> passing through the {{ExecutionAttemptID}} and checking against the current 
> executions to make sure the to-be-failed one is still running, so we won't 
> fail the newly restarted one by accident.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (FLINK-13593) Prevent failing the wrong execution attempt in CheckpointFailureManager

Reply via email to