fredia commented on PR #20404:
URL: https://github.com/apache/flink/pull/20404#issuecomment-1204991692

   > Could you please briefly describe the root cause of the failure in the 
ticket?
   
   - For "Checkpoint expired before completing": 
     Because I set the materialized interval(100ms) too short, blocking the 
snapshot.
   The `StreamTaskActionExecutor` in the `mailboxExecutor` of 
`PeriodicMaterializationManager` is the same as the `actionExecutor` used for 
snapshot in the `StreamTask`.
   - For "getLatestCompletedCheckpointPath : [No value 
present](https://issues.apache.org/jira/browse/FLINK-28529?focusedCommentId=17566585&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17566585)":
   I'm not sure the root cause. It can be seen from the log that there is a 
checkpoint completed, I guess `CheckpointStatsHistory` is deleted after 
`miniCluster.cancelJob()`?
   
   ```
   09:34:31,289 [jobmanager-io-thread-8] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 3 for job 90dd9b9d1170496b98b83499d94bab24 (3949205 bytes, 
checkpointDuration=339 ms, finalizationTime=0 ms).
   09:34:31,291 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 4 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1657704871290 for job 
90dd9b9d1170496b98b83499d94bab24.
   09:34:31,323 [KeyedProcess (3/4)#0] INFO  
org.apache.flink.state.changelog.ChangelogKeyedStateBackend  [] - snapshot of 
KeyedProcess (3/4)#0 for checkpoint 4, change range: 0..2
   09:34:31,345 [KeyedProcess (1/4)#0] INFO  
org.apache.flink.state.changelog.ChangelogKeyedStateBackend  [] - snapshot of 
KeyedProcess (1/4)#0 for checkpoint 4, change range: 0..2
   09:34:31,349 [KeyedProcess (4/4)#0] INFO  
org.apache.flink.state.changelog.ChangelogKeyedStateBackend  [] - snapshot of 
KeyedProcess (4/4)#0 for checkpoint 4, change range: 0..2
   09:34:31,394 [KeyedProcess (2/4)#0] INFO  
org.apache.flink.state.changelog.ChangelogKeyedStateBackend  [] - snapshot of 
KeyedProcess (2/4)#0 for checkpoint 4, change range: 0..2
   09:34:31,540 [jobmanager-io-thread-20] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 4 for job 90dd9b9d1170496b98b83499d94bab24 (7474937 bytes, 
checkpointDuration=249 ms, finalizationTime=1 ms).
   ...
   09:34:31,576 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger checkpoint for job 90dd9b9d1170496b98b83499d94bab24 since Checkpoint 
triggering task Source: Custom Source (1/1) of job 
90dd9b9d1170496b98b83499d94bab24 is not being executed at the moment. Aborting 
checkpoint. Failure reason: Not all required tasks are currently running..
   ...
   09:34:31,588 [flink-akka.actor.default-dispatcher-8] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - KeyedProcess 
(1/4) (70022b0f9f233c9f2f04db549ef7529c_0a448493b4782967b150582570326227_0_0) 
switched from 
   09:34:31,674 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Stopped Akka 
RPC service.
   09:34:31,688 [                main] ERROR 
org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase
 [] - 
   
--------------------------------------------------------------------------------
   Test testSwitchFromDisablingToEnablingInClaimMode[delegated state backend 
type = 
org.apache.flink.runtime.state.hashmap.HashMapStateBackend@312aa7c](org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase)
 failed with:
   java.util.NoSuchElementException: No value present
        at java.util.Optional.get(Optional.java:135)
        at 
org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase.testSwitchFromDisablingToEnablingInClaimMode(ChangelogPeriodicMaterializationSwitchStateBackendITCase.java:115)
   
   ```
   
   > In either case, could you please explain why the refactoring is necessary 
(e.g. in the commit message)?
   
   This PR is mainly to make the test more stable rather than just refactoring, 
I will update the commit message.
   
   > I'm also wondering whether PeriodicMaterialization part in test name is 
still relevant; if not, probably this PR is a good place to rename it. WDYT?
   
   Agreed, how about renaming it to `ChangelogSwitchStateBackendITCase`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to