[
https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Khachatryan updated FLINK-15416:
--------------------------------------
Release Note: new configuration parameter: taskmanager.network.retries
> Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
> -----------------------------------------------------------------------
>
> Key: FLINK-15416
> URL: https://issues.apache.org/jira/browse/FLINK-15416
> Project: Flink
> Issue Type: Wish
> Components: Runtime / Network
> Affects Versions: 1.10.0
> Reporter: Zhenqiu Huang
> Assignee: Zhenqiu Huang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We run a flink with 256 TMs in production. The job internally has keyby
> logic. Thus, it builds a 256 * 256 communication channels. An outage happened
> when there is a chip internal link of one of the network switchs broken that
> connecting these machines. During the outage, the flink can't restart
> successfully as there is always an exception like "Connecting the channel
> failed: Connecting to remote task manager + '****/10.14.139.6:41300' has
> failed. This might indicate that the remote task manager has been lost.
> After deep investigation with the network infrastructure team, we found there
> are 6 switchs connecting with these machines. Each switch has 32 physcal
> links. Every socket is round-robin assigned to each of links for load
> balances. Thus, there is always average 256 * 256 / 6 * 32 * 2 = 170
> channels will be assigned to the broken link. The issue lasted for 4 hours
> until we found the broken link and restart the problematic switch.
> Given this, we found that the retry of creating channel will help to resolve
> this issue. For our networking topology, we can set retry to 2. As 170 / (132
> * 132) < 1, which means after retry twice no channel in 170 channels will be
> assigned to the broken link in the average case.
> I think it is valuable fix for this kind of partial network partition.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)