[
https://issues.apache.org/jira/browse/FLINK-25055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-25055:
-----------------------------------
Labels: pull-request-available stale-assigned (was: pull-request-available)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issue is assigned but has not
received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a
comment updating the community on your progress. If this issue is waiting on
feedback, please consider this a reminder to the committer/reviewer. Flink is a
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone
else may work on it.
> Support listen and notify mechanism for PartitionRequest
> --------------------------------------------------------
>
> Key: FLINK-25055
> URL: https://issues.apache.org/jira/browse/FLINK-25055
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Network
> Affects Versions: 1.14.0, 1.12.5, 1.13.3
> Reporter: Shammon
> Assignee: Shammon
> Priority: Major
> Labels: pull-request-available, stale-assigned
>
> We submit batch jobs to flink session cluster with eager scheduler for olap.
> JM deploys subtasks to TaskManager independently, and the downstream subtasks
> may start before the upstream ones are ready. The downstream subtask sends
> PartitionRequest to upstream ones, and may receive PartitionNotFoundException
> from them. Then it will retry to send PartitionRequest after a few ms until
> timeout.
> The current approach raises two problems. First, there will be too many retry
> PartitionRequest messages. Each downstream subtask will send PartitionRequest
> to all its upstream subtasks and the total number of messages will be O(N*N),
> where N is the parallelism of subtasks. Secondly, the interval between
> polling retries will increase the delay for upstream and downstream tasks to
> confirm PartitionRequest.
> We want to support listen and notify mechanism for PartitionRequest when the
> job needs no failover. Upstream TaskManager will add the PartitionRequest to
> a listen list with a timeout checker, and notify the request when the task
> register its partition in the TaskManager.
> [~nkubicek] I noticed that your scenario of using flink is similar to ours.
> What do you think? And hope to hear from you [~trohrmann] THX
--
This message was sent by Atlassian Jira
(v8.20.10#820010)