yunfengzhou-hub opened a new pull request, #20275:
URL: https://github.com/apache/flink/pull/20275
## What is the purpose of the change
Now, the operator events sent from a coordinator to its subtasks would be
temporarily blocked when the coordinator starts checkpointing and be unblocked
after the checkpoint barriers are injected into the sources. This PR extends
the duration of this blocking period so that an event for a subtask is blocked
until the subtask completes the current checkpoint. By doing this, this PR
preserves the consistency of these operator events during checkpoints and
failovers.
This PR is a component of FLINK-26029, which aims to generalize the
checkpoint protocol of operator coordinators and makes coordinators applicable
for non-source operators as well.
## Brief change log
- The subtask gateway can be closed/reopened and replaces the operator event
valve.
- An operator event type is introduced to notify the coordinator about a
subtask completing a checkpoint.
- Subtasks would send the operator event above to their coordinators after
snapshotting their states.
- The holder of an operator coordinator reopens a subtask gateway after it
receives the operator event above, instead of after checkpoint barriers are
injected into sources.
## Verifying this change
This change added tests and can be verified as follows:
- Added unit tests to verify the correctness of the closing/reopening
behavior of subtask gateways.
- Added integration tests to verify that the operator events sent from
coordinators to subtask would not be lost after failover, even if the events
are sent during checkpoints.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (yes)
- This PR additionally blocks the communication between a coordinator
and its subtasks from when checkpoint barriers are sent to sources to when
subtasks complete the checkpoint. This means that an operator event generated
during checkpoints would be delivered only after the target subtask completes
the checkpoint, bringing longer latency to that event.
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
- This PR affects the latency of operator events from coordinators to
subtasks, as described above.
- The S3 file system connector: (no)
## Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]