rkhachatryan opened a new pull request #14635:
URL: https://github.com/apache/flink/pull/14635


   ## What is the purpose of the change
   
   Update checkpoint statistics (shown in the web UI) even after a checkpoint 
fails
   (this would facilitate investigation of issues with slow checkpointing).
   
   With this change, failed checkpoint stats is updated when:
   1. Subtask acks a checkpoint too late or after some other failure. 
`AsyncCheckpointRunnable` completes normally and reports snapshot as usual. 
`CheckpointCoordinator` was updated to handle these calls
   1. Subtask receives abortion notification and cancels the runnable before it 
completes. In this case it only reports the metrics. Both TM and JM sides were 
updated and a **new RPC added**
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   - `CheckpointCoordinatorTest.testCheckpointStatsUpdatedAfterFailure`
   - `CheckpointCoordinatorTest.testAbortedCheckpointStatsUpdatedAfterFailure`
   - Manually verified the change by running `DataStreamAllroundTestProgram` on 
local cluser:
   ```
   execution.checkpointing.interval: 10s
   execution.checkpointing.min-pause: 1s
   execution.checkpointing.timeout: 1s
   execution.checkpointing.tolerable-failed-checkpoints: 1000000
   execution.checkpointing.unaligned: true
   taskmanager.numberOfTaskSlots: 8
   web.checkpoints.history: 100
   ```
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes 
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to