[ 
https://issues.apache.org/jira/browse/FLINK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464225#comment-17464225
 ] 

fanrui commented on FLINK-25417:
--------------------------------

Yeah, [~Thesharing] thanks for your reminder. I will watch FLINK-22643.

> Too many connections for TM
> ---------------------------
>
>                 Key: FLINK-25417
>                 URL: https://issues.apache.org/jira/browse/FLINK-25417
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.15.0, 1.13.5, 1.14.2
>            Reporter: fanrui
>            Priority: Major
>         Attachments: image-2021-12-22-19-17-59-486.png, 
> image-2021-12-22-19-18-23-138.png
>
>
> Hi masters, when the number of task exceeds 10, some TM has more than 4000 
> TCP connections.
> !image-2021-12-22-19-17-59-486.png|width=1388,height=307!
>  
> h2. Reason:
> When the task is initialized, the downstream InputChannel will connect to the 
> upstream ResultPartition.
> In PartitionRequestClientFactory#createPartitionRequestClient, there is a 
> clients({_}ConcurrentMap<ConnectionID, 
> CompletableFuture{_}{_}<NettyPartitionRequestClient>{_}{_}> clients{_}). It's 
> a cache to avoid repeated tcp connections. But the ConnectionID has a field 
> is connectionIndex.
> The connectionIndex comes from IntermediateResult, which is a random number. 
> When multiple Tasks are running in a TM, other TMs need to establish multiple 
> connections to this TM, and each Task has a NettyPartitionRequestClient.
> Assume that the parallelism of the flink job is 100, each TM has 20 Tasks, 
> and the Partition strategy between tasks is rebalance or hash. Then the 
> number of connections for a single TM is (20-1) * 100 * 2 = 3800. If multiple 
> such TMs are running on a single node, there is a risk.
>  
> I want to know whether it is risky to change the cache key to 
> connectionID.address? That is: a tcp connection is shared between all Tasks 
> of TM. 
> I guess it is feasible because:
>  # I have tested it and the task can run normally.
>  # The Message contains the InputChannelID, which is used to distinguish 
> which channel the NettyMessage belongs to.
>  
> !image-2021-12-22-19-18-23-138.png|width=2953,height=686!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to