StephanEwen opened a new pull request #11274: [FLINK-16177][checkpointing] Integrate OperatorCoordinator checkpoint triggering and committing URL: https://github.com/apache/flink/pull/11274 ## What is the purpose of the change **This is part of [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface?src=contextnavpagetreemode)** This integrates the previously added `OperatorCoordinators` with checkpointing, in particular with checkpoint triggering and committing. This PR implements initially a simplified version of checkpoint triggering that does not offer the facilities to align coordinator checkpoints and source checkpoint barrier triggering. ## Brief change log - Add "Operator Coordinator State" to the "OperatorState" - Introduce a new version of the Checkpoint Metadata format - Add Coordinator State to `PendingCheckpoint` - Add a new stage to the checkpoint triggering procedure ## Not Addressed issues Separate follow-ups are needed for the following two issues: - **Checkpoint restoring** The operator coordinators only need to restore on "global restore", meaning when the master fails over, or when a "global job failure" is triggered. Otherwise, the coordinators are notified of local task failures (for example to invalidate specific split assignments). To support that, the calls from the scheduler to the checkpoint coordinator need to differentiate between "common failover" (pipelined region) and full job failover (safety net, when encountering inconsistencies / unexpected exceptions in the execution graph or in the coordinators). See [FLINK-16357](https://issues.apache.org/jira/browse/FLINK-16357) - **Exactly Once** alignment between coordinator checkpoints and source barrier injection is not included. In this version here, the source barriers are injected once all coordinators are done with their checkpoints. There exist different ideas for how to deal with in flight `OperatorEvents`, which we plan to validate when implementing the `SourceEnumeratorCoordinator`. Hence, this is not yet included, but the code is written to support adding these changes isolated (in the `OperatorCoordinatorCheckpointContext`). ## Verifying this change This change is already covered by existing tests, and partially extends the class `PendingCheckpointTest`. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **yes** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **no** - If yes, how is the feature documented? **not applicable**
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
