Jiangjie Qin created FLINK-26029:
------------------------------------
Summary: Generalize the checkpoint protocol of OperatorCoordinator.
Key: FLINK-26029
URL: https://issues.apache.org/jira/browse/FLINK-26029
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Affects Versions: 1.14.3
Reporter: Jiangjie Qin
Currently the JM opens all the event valves from the OperatorCoordinator to the
subtasks after the checkpoint barriers are sent to the Source subtasks. While
this works for the Source Operators, it unnecessarily limits general usage of
the OperatorCoordinator for other operators.
To generalize the protocol, we can change the JM to open the event valve of the
subtasks that have finished the local checkpoint. So the protocol would become
following:
# Let the OC finish processing all the incoming OperatorEvents before the
snapshot.
# Wait until all the outgoing OperatorEvents before the snapshot are sent and
acked.
# Shut the event valve so no outgoing events can be sent to the subtasks.
# Send checkpoint barriers to the Source operators.
# Open the corresponding event valve of a subtask when the
AcknowledgeCheckpoint messages from that subtask is received.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)