[
https://issues.apache.org/jira/browse/FLINK-28639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-28639:
-----------------------------------
Labels: pull-request-available stale-assigned (was: pull-request-available)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issue is assigned but has not
received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a
comment updating the community on your progress. If this issue is waiting on
feedback, please consider this a reminder to the committer/reviewer. Flink is a
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone
else may work on it.
> Preserve distributed consistency of OperatorEvents from subtasks to
> OperatorCoordinator
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-28639
> URL: https://issues.apache.org/jira/browse/FLINK-28639
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.3
> Reporter: Yunfeng Zhou
> Assignee: Yunfeng Zhou
> Priority: Minor
> Labels: pull-request-available, stale-assigned
>
> This is the second step to solving the consistency issue of OC
> communications. In this step, we would also guarantee the consistency of
> operator events sent from subtasks to OCs. Combined with the other subtask to
> preserve the consistency of communications in the reverse direction, all
> communications between OC and subtasks would be consistent across checkpoints
> and global failovers.
> To achieve the goal of this step, we need to add closing/reopening functions
> to the subtasks' gateways and make the subtasks aware of a checkpoint before
> they receive the checkpoint barriers. The general process would be as follows.
> 1. When the OC starts checkpoint, it notifies all subtasks about this
> information.
> 2. After being notified about the ongoing checkpoint in OC, a subtask sends a
> special operator event to its OC, which is the last operator event the OC
> could receive from the subtask before the subtask completes the checkpoint.
> Then the subtask closes its gateway.
> 3. After receiving this special event from all subtasks, the OC finishes its
> checkpoint and closes its gateway. Then the checkpoint coordinator sends
> checkpoint barriers to the sources.
> 4. If the subtask or the OC generate any event to send to each other, they
> buffer the events locally.
> 5. When a subtask starts checkpointing, it also stores the buffered events in
> the checkpoint.
> 6. After the subtask completes the checkpoint, communications in both
> directions are recovered and the buffered events are sent out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)