Hi devs,

I am a new contributor to flink and would like to resolve
https://issues.apache.org/jira/browse/FLINK-28398, which is open since
2022.

I can consistently reproduce the thread block on the latest master in local
by setting MINIMAL_CHECKPOINT_TIME = 1L

Findings*:*

The issue appears in the case where the timeout canceller aborts the
checkpoint before TestingMasterHook triggers. This prevents
triggerCheckpointLatch from being unlocked, causing the thread to be
blocked on await().

Proposed Fix*:*

I’ve found that adding a whenComplete to the firstCheckpoint future to
manually release the latch resolves the blocked state.

I'd like to discuss if this is the preferred approach and if I should move
forward with a PR. As the ticket is currently unassigned, I’m happy to take
it up.

Warm regards,
Souptik

Reply via email to