Keith Lee created FLINK-37808: --------------------------------- Summary: Checkpoint completed after job failure Key: FLINK-37808 URL: https://issues.apache.org/jira/browse/FLINK-37808 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.18.0 Reporter: Keith Lee
We found a case where checkpoint was marked as completed after job failure (due to loss of leadership). The checkpoint was subsequently used for automatic recovery, is this by design? Could it have caused issue in jobs with two phase commit sinks? 1. Checkpoint was triggered. ``` 2025-04-09T10:26:31.077Z Triggering checkpoint 3270594 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1744194390986 for job REDACTED. ``` 2. JobManager lost leadership ``` 2025-04-09T10:26:32.868Z Closing TaskExecutor connection 10.99.68.36:6122-db3d6b because: ResourceManager leader changed to new address null ... 2025-04-09T10:26:33.940Z Disconnect TaskExecutor 10.99.68.36:6122-db3d6b because: Job leader for job id REDACTED lost leadership. ``` 3. Job failed and restarting ``` 2025-04-09T10:26:33.982Z Job Flink Streaming Job (REDACTED) switched from state RUNNING to RESTARTING. ``` 4. Checkpoint 3270594 was unexpectedly marked as completed instead of failed ``` 2025-04-09T10:26:34.719Z Completed checkpoint 3270594 for job REDACTED (346358222 bytes, checkpointDuration=2605 ms, finalizationTime=1127 ms). ``` 5. Job was then restored from checkpoint which should have failed. ``` 2025-04-09T10:26:44.880Z Restoring job REDACTED from Checkpoint 3270594 ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)