Keith Lee created FLINK-37808:
---------------------------------

             Summary: Checkpoint completed after job failure 
                 Key: FLINK-37808
                 URL: https://issues.apache.org/jira/browse/FLINK-37808
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.18.0
            Reporter: Keith Lee


We found a case where checkpoint was marked as completed after job failure (due 
to loss of leadership). The checkpoint was subsequently used for automatic 
recovery, is this by design? Could it have caused issue in jobs with two phase 
commit sinks?

1. Checkpoint was triggered.

```
2025-04-09T10:26:31.077Z   Triggering checkpoint 3270594 
(type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) 
@ 1744194390986 for job REDACTED.
```

2. JobManager lost leadership

```
2025-04-09T10:26:32.868Z   Closing TaskExecutor connection 
10.99.68.36:6122-db3d6b because: ResourceManager leader changed to new address 
null
...
2025-04-09T10:26:33.940Z   Disconnect TaskExecutor 10.99.68.36:6122-db3d6b 
because: Job leader for job id REDACTED lost leadership.
```

3. Job failed and restarting

```
2025-04-09T10:26:33.982Z   Job Flink Streaming Job (REDACTED) switched from 
state RUNNING to RESTARTING.
```

4. Checkpoint 3270594 was unexpectedly marked as completed instead of failed

```
2025-04-09T10:26:34.719Z   Completed checkpoint 3270594 for job REDACTED 
(346358222 bytes, checkpointDuration=2605 ms, finalizationTime=1127 ms).
```

5. Job was then restored from checkpoint which should have failed.

```
2025-04-09T10:26:44.880Z Restoring job REDACTED from Checkpoint 3270594 
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to