Hi devs, I am a new contributor to flink and would like to resolve https://issues.apache.org/jira/browse/FLINK-28398, which is open since 2022.
I can consistently reproduce the thread block on the latest master in local by setting MINIMAL_CHECKPOINT_TIME = 1L Findings*:* The issue appears in the case where the timeout canceller aborts the checkpoint before TestingMasterHook triggers. This prevents triggerCheckpointLatch from being unlocked, causing the thread to be blocked on await(). Proposed Fix*:* I’ve found that adding a whenComplete to the firstCheckpoint future to manually release the latch resolves the blocked state. I'd like to discuss if this is the preferred approach and if I should move forward with a PR. As the ticket is currently unassigned, I’m happy to take it up. Warm regards, Souptik
