fredia commented on PR #20404:
URL: https://github.com/apache/flink/pull/20404#issuecomment-1204991692
> Could you please briefly describe the root cause of the failure in the
ticket?
- For "Checkpoint expired before completing":
Because I set the materialized interval(100ms) too short, blocking the
snapshot.
The `StreamTaskActionExecutor` in the `mailboxExecutor` of
`PeriodicMaterializationManager` is the same as the `actionExecutor` used for
snapshot in the `StreamTask`.
- For "getLatestCompletedCheckpointPath : [No value
present](https://issues.apache.org/jira/browse/FLINK-28529?focusedCommentId=17566585&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17566585)":
I'm not sure the root cause. It can be seen from the log that there is a
checkpoint completed, I guess `CheckpointStatsHistory` is deleted after
`miniCluster.cancelJob()`?
```
09:34:31,289 [jobmanager-io-thread-8] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 3 for job 90dd9b9d1170496b98b83499d94bab24 (3949205 bytes,
checkpointDuration=339 ms, finalizationTime=0 ms).
09:34:31,291 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 4 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1657704871290 for job
90dd9b9d1170496b98b83499d94bab24.
09:34:31,323 [KeyedProcess (3/4)#0] INFO
org.apache.flink.state.changelog.ChangelogKeyedStateBackend [] - snapshot of
KeyedProcess (3/4)#0 for checkpoint 4, change range: 0..2
09:34:31,345 [KeyedProcess (1/4)#0] INFO
org.apache.flink.state.changelog.ChangelogKeyedStateBackend [] - snapshot of
KeyedProcess (1/4)#0 for checkpoint 4, change range: 0..2
09:34:31,349 [KeyedProcess (4/4)#0] INFO
org.apache.flink.state.changelog.ChangelogKeyedStateBackend [] - snapshot of
KeyedProcess (4/4)#0 for checkpoint 4, change range: 0..2
09:34:31,394 [KeyedProcess (2/4)#0] INFO
org.apache.flink.state.changelog.ChangelogKeyedStateBackend [] - snapshot of
KeyedProcess (2/4)#0 for checkpoint 4, change range: 0..2
09:34:31,540 [jobmanager-io-thread-20] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 4 for job 90dd9b9d1170496b98b83499d94bab24 (7474937 bytes,
checkpointDuration=249 ms, finalizationTime=1 ms).
...
09:34:31,576 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
trigger checkpoint for job 90dd9b9d1170496b98b83499d94bab24 since Checkpoint
triggering task Source: Custom Source (1/1) of job
90dd9b9d1170496b98b83499d94bab24 is not being executed at the moment. Aborting
checkpoint. Failure reason: Not all required tasks are currently running..
...
09:34:31,588 [flink-akka.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - KeyedProcess
(1/4) (70022b0f9f233c9f2f04db549ef7529c_0a448493b4782967b150582570326227_0_0)
switched from
09:34:31,674 [flink-akka.actor.default-dispatcher-4] INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopped Akka
RPC service.
09:34:31,688 [ main] ERROR
org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase
[] -
--------------------------------------------------------------------------------
Test testSwitchFromDisablingToEnablingInClaimMode[delegated state backend
type =
org.apache.flink.runtime.state.hashmap.HashMapStateBackend@312aa7c](org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase)
failed with:
java.util.NoSuchElementException: No value present
at java.util.Optional.get(Optional.java:135)
at
org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase.testSwitchFromDisablingToEnablingInClaimMode(ChangelogPeriodicMaterializationSwitchStateBackendITCase.java:115)
```
> In either case, could you please explain why the refactoring is necessary
(e.g. in the commit message)?
This PR is mainly to make the test more stable rather than just refactoring,
I will update the commit message.
> I'm also wondering whether PeriodicMaterialization part in test name is
still relevant; if not, probably this PR is a good place to rename it. WDYT?
Agreed, how about renaming it to `ChangelogSwitchStateBackendITCase`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]