[GitHub] [flink] StephanEwen opened a new pull request #11274: [FLINK-16177][checkpointing] Integrate OperatorCoordinator checkpoint triggering and committing

GitBox Sun, 01 Mar 2020 15:22:20 -0800

StephanEwen opened a new pull request #11274: [FLINK-16177][checkpointing] 
Integrate OperatorCoordinator checkpoint triggering and committing
URL: https://github.com/apache/flink/pull/11274
 
 
   ## What is the purpose of the change
   
   **This is part of 
[FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface?src=contextnavpagetreemode)**
   
   This integrates the previously added `OperatorCoordinators` with 
checkpointing, in particular with checkpoint triggering and committing.
   
   This PR implements initially a simplified version of checkpoint triggering 
that does not offer the facilities to align coordinator checkpoints and source 
checkpoint barrier triggering. 
   
   ## Brief change log
   
     - Add "Operator Coordinator State" to the "OperatorState"
     - Introduce a new version of the Checkpoint Metadata format
     - Add Coordinator State to `PendingCheckpoint`
     - Add a new stage to the checkpoint triggering procedure
   
   ## Not Addressed issues
   
   Separate follow-ups are needed for the following two issues:
   
     - **Checkpoint restoring** The operator coordinators only need to restore 
on "global restore", meaning when the master fails over, or when a "global job 
failure" is triggered. Otherwise, the coordinators are notified of local task 
failures (for example to invalidate specific split assignments).
   
       To support that, the calls from the scheduler to the checkpoint 
coordinator need to differentiate between "common failover" (pipelined region) 
and full job failover (safety net, when encountering inconsistencies / 
unexpected exceptions in the execution graph or in the coordinators). See 
[FLINK-16357](https://issues.apache.org/jira/browse/FLINK-16357)
   
     - **Exactly Once** alignment between coordinator checkpoints and source 
barrier injection is not included. In this version here, the source barriers 
are injected once all coordinators are done with their checkpoints. There exist 
different ideas for how to deal with in flight `OperatorEvents`, which we plan 
to validate when implementing the `SourceEnumeratorCoordinator`. Hence, this is 
not yet included, but the code is written to support adding these changes 
isolated (in the `OperatorCoordinatorCheckpointContext`).
   
   ## Verifying this change
   
   This change is already covered by existing tests, and partially extends the 
class `PendingCheckpointTest`.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **yes**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **not applicable**


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen opened a new pull request #11274: [FLINK-16177][checkpointing] Integrate OperatorCoordinator checkpoint triggering and committing

Reply via email to