[
https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-22643:
-----------------------------------
Labels: auto-deprioritized-major (was: stale-major)
Priority: Minor (was: Major)
This issue was labeled "stale-major" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Major, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> Too many TCP connections among TaskManagers for large scale jobs
> ----------------------------------------------------------------
>
> Key: FLINK-22643
> URL: https://issues.apache.org/jira/browse/FLINK-22643
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Network
> Affects Versions: 1.13.0
> Reporter: Zhilong Hong
> Priority: Minor
> Labels: auto-deprioritized-major
> Fix For: 1.14.0
>
>
> For the large scale jobs, there will be too many TCP connections among
> TaskManagers. Let's take an example.
> For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism.
> We divide the vertices into 5 slot sharing groups. Each TaskManager has 5
> slots. Thus there will be 400 taskmanagers in this job. Let's assume that job
> runs on a cluster with 20 machines.
> If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 =
> 303,240 TCP connections for each machine. If we run several jobs on this
> cluster, the TCP connections may exceed the maximum limit of linux, which is
> 1,048,576. This will stop the TaskManagers from creating new TCP connections
> and cause task failovers.
> As we run our production jobs on a K8S cluster, the job always failover due
> to exceptions related to network, such as {{Sending the partition request to
> 'null' failed}}, and etc.
> We think that we can decrease the number of connections by letting tasks
> reuse the same connection. We implemented a POC that makes all tasks on the
> same TaskManager reuse one TCP connection. For the example job we mentioned
> above, the number of connections will decrease from 303,240 to 15960. With
> the POC, the frequency of meeting exceptions related to network in our
> production jobs drops significantly.
> The POC is illustrated in:
> https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)