ifndef-SleePy opened a new pull request #9269: [FLINK-9900][tests] Fix unstable 
ZooKeeperHighAvailabilityITCase
URL: https://github.com/apache/flink/pull/9269
 
 
   ## What is the purpose of the change
   
   * Fix unstable 
`ZooKeeperHighAvailabilityITCase`.`testRestoreBehaviourWithFaultyStateHandles`
   
   * The case is designed as below
     - This case assume that the first 5 checkpoints (1-5) would success
     - Then the job blocks on the snapshot of checkpoint 6
     - At this time, the checkpoint files are moved on purpose
     - The checkpoint 6 would fail due to an expected snapshot failure
     - Then the job would be fail due to this failure checkpoint
     - And the job could not recover from checkpoint 5 because there is no 
checkpoint file
     - After moving these checkpoint files back, the job could recover and 
continue working.
   
   * But there is a race condition of failing the job and triggering another 
checkpoint
   * There might be an unexpected successful checkpoint 7 if the job canceling 
is not fast enough
   * This job could recover from checkpoint 7 without waiting these checkpoint 
files moved back
   
   ## Brief change log
   
   * The basic idea of fixing is that preventing the unexpected checkpoint 7
   * Add a latch to block snapshot until the HA storage is recovered
   
   ## Verifying this change
   
   * This change is already covered by existing tests
   * This unstable scenario can be reproduced as below
     - There is a race condition of failing the job and triggering another 
checkpoint
     - Making the job failing more slowly would reproduce the scenario
     - Modify the `FailJobCallback` of `CheckpointFailureManager` in 
`ExecutionGraph`.`enableCheckpointing`, change the `execute` to `schedule` with 
a delay
     - There would be an unexpected successful checkpoint 7
     - This case would hang forever because it never fail 5 times because it 
could recover from checkpoint 7
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to