[GitHub] [flink] yunfengzhou-hub opened a new pull request, #20275: [FLINK-26029][Runtime/Checkpointing] Generalize the checkpoint protocol of OperatorCoordinator

GitBox Thu, 14 Jul 2022 05:03:27 -0700


yunfengzhou-hub opened a new pull request, #20275:
URL: https://github.com/apache/flink/pull/20275


   ## What is the purpose of the change
   
   Now, the operator events sent from a coordinator to its subtasks would be 
temporarily blocked when the coordinator starts checkpointing and be unblocked 
after the checkpoint barriers are injected into the sources. This PR extends 
the duration of this blocking period so that an event for a subtask is blocked 
until the subtask completes the current checkpoint. By doing this, this PR 
preserves the consistency of these operator events during checkpoints and 
failovers.
   
   This PR is a component of FLINK-26029, which aims to generalize the 
checkpoint protocol of operator coordinators and makes coordinators applicable 
for non-source operators as well.
   
   ## Brief change log
   
   - The subtask gateway can be closed/reopened and replaces the operator event 
valve.
   - An operator event type is introduced to notify the coordinator about a 
subtask completing a checkpoint.
   - Subtasks would send the operator event above to their coordinators after 
snapshotting their states.
   - The holder of an operator coordinator reopens a subtask gateway after it 
receives the operator event above, instead of after checkpoint barriers are 
injected into sources.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
   - Added unit tests to verify the correctness of the closing/reopening 
behavior of subtask gateways.
   - Added integration tests to verify that the operator events sent from 
coordinators to subtask would not be lost after failover, even if the events 
are sent during checkpoints.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (yes)
       - This PR additionally blocks the communication between a coordinator 
and its subtasks from when checkpoint barriers are sent to sources to when 
subtasks complete the checkpoint. This means that an operator event generated 
during checkpoints would be delivered only after the target subtask completes 
the checkpoint, bringing longer latency to that event.
   
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
       - This PR affects the latency of operator events from coordinators to 
subtasks, as described above.
   
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (no)
     - If yes, how is the feature documented? (not applicable)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] yunfengzhou-hub opened a new pull request, #20275: [FLINK-26029][Runtime/Checkpointing] Generalize the checkpoint protocol of OperatorCoordinator

Reply via email to