[ 
https://issues.apache.org/jira/browse/FLINK-39140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-39140:
----------------------------
    Description: 
Current Unaligned Checkpoint ITCases only restart once from a normal 
checkpoint. They do not cover restoring from a checkpoint produced by recovery 
phase — which is the key scenario for checkpointing during recovery.

*Proposed mechanism:* After restoring from a checkpoint, wait for the first new 
checkpoint to be produced, then immediately trigger a restart from it. Repeat 
for a configurable number of rounds (≥ 2). Whether to rescale depends on the 
specific test case.

This mechanism works on the current master (validating normal checkpoint 
recovery). Once checkpointing during recovery is enabled, the same tests 
automatically cover recovery-phase checkpoint scenarios.

  was:
Current Unaligned Checkpoint ITCases only restart once from a normal 
checkpoint. They do not cover restoring from a checkpoint produced by recovery 
phase — which is the key scenario for checkpointing during recovery.

*Proposed mechanism:* After restoring from a checkpoint, wait for the first new 
checkpoint to be produced, then immediately trigger a restart from it. Repeat 
for a configurable number of rounds (≥ 2). Whether to rescale depends on the 
specific test case.

This mechanism works on the current master (validating normal checkpoint 
recovery). Once checkpointing during recovery is enabled, the same tests 
automatically cover recovery-phase checkpoint scenarios.
h2. Affected ITCases
 * UnalignedCheckpointRescaleITCase
 * UnalignedCheckpointRescaleWithMixedExchangesITCase
 * UnalignedCheckpointITCase
 * UnalignedCheckpointCompatibilityITCase
 * UnalignedCheckpointStressITCase
 * UnalignedCheckpointFailureHandlingITCase


> Enhance Unaligned Checkpoint ITCases to perform checkpointing during recovery
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-39140
>                 URL: https://issues.apache.org/jira/browse/FLINK-39140
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> Current Unaligned Checkpoint ITCases only restart once from a normal 
> checkpoint. They do not cover restoring from a checkpoint produced by 
> recovery phase — which is the key scenario for checkpointing during recovery.
> *Proposed mechanism:* After restoring from a checkpoint, wait for the first 
> new checkpoint to be produced, then immediately trigger a restart from it. 
> Repeat for a configurable number of rounds (≥ 2). Whether to rescale depends 
> on the specific test case.
> This mechanism works on the current master (validating normal checkpoint 
> recovery). Once checkpointing during recovery is enabled, the same tests 
> automatically cover recovery-phase checkpoint scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to