[
https://issues.apache.org/jira/browse/FLINK-39140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rui Fan updated FLINK-39140:
----------------------------
Description:
Current Unaligned Checkpoint ITCases only restart once from a normal
checkpoint. They do not cover restoring from a checkpoint produced by recovery
phase — which is the key scenario for checkpointing during recovery.
*Proposed mechanism:* After restoring from a checkpoint, wait for the first new
checkpoint to be produced, then immediately trigger a restart from it. Repeat
for a configurable number of rounds (≥ 2). Whether to rescale depends on the
specific test case.
This mechanism works on the current master (validating normal checkpoint
recovery). Once checkpointing during recovery is enabled, the same tests
automatically cover recovery-phase checkpoint scenarios.
was:
Current Unaligned Checkpoint ITCases only restart once from a normal
checkpoint. They do not cover restoring from a checkpoint produced by recovery
phase — which is the key scenario for checkpointing during recovery.
*Proposed mechanism:* After restoring from a checkpoint, wait for the first new
checkpoint to be produced, then immediately trigger a restart from it. Repeat
for a configurable number of rounds (≥ 2). Whether to rescale depends on the
specific test case.
This mechanism works on the current master (validating normal checkpoint
recovery). Once checkpointing during recovery is enabled, the same tests
automatically cover recovery-phase checkpoint scenarios.
h2. Affected ITCases
* UnalignedCheckpointRescaleITCase
* UnalignedCheckpointRescaleWithMixedExchangesITCase
* UnalignedCheckpointITCase
* UnalignedCheckpointCompatibilityITCase
* UnalignedCheckpointStressITCase
* UnalignedCheckpointFailureHandlingITCase
> Enhance Unaligned Checkpoint ITCases to perform checkpointing during recovery
> -----------------------------------------------------------------------------
>
> Key: FLINK-39140
> URL: https://issues.apache.org/jira/browse/FLINK-39140
> Project: Flink
> Issue Type: Sub-task
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
>
> Current Unaligned Checkpoint ITCases only restart once from a normal
> checkpoint. They do not cover restoring from a checkpoint produced by
> recovery phase — which is the key scenario for checkpointing during recovery.
> *Proposed mechanism:* After restoring from a checkpoint, wait for the first
> new checkpoint to be produced, then immediately trigger a restart from it.
> Repeat for a configurable number of rounds (≥ 2). Whether to rescale depends
> on the specific test case.
> This mechanism works on the current master (validating normal checkpoint
> recovery). Once checkpointing during recovery is enabled, the same tests
> automatically cover recovery-phase checkpoint scenarios.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)