[
https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006134#comment-17006134
]
Piotr Nowojski edited comment on FLINK-15416 at 12/31/19 3:31 PM:
------------------------------------------------------------------
I'm not sure. I think the motivating example that you provided is not very
compelling. After all the proper fix is/was to just restart/fix the network and
setting the number of retries is pretty arbitrary (why 2? why not 3?) and still
can fail.
On the other hand, If we wanted to handle some more generic networking
failures, we would need to handle a lot more things, like connection lost,
timeouts, etc, not only connection failed, which also could impact
restarting/recovery times.
was (Author: pnowojski):
I'm not sure. I think the motivating example that you provided is not very
compelling. After all the proper fix is/was to just restart/fix the network and
setting the number of retries is pretty arbitrarily and still can fail.
On the other hand, If we wanted to handle some more generic networking
failures, we would need to handle a lot more things, like connection lost,
timeouts, etc, not only connection failed, which also could impact
restarting/recovery times.
> Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
> -----------------------------------------------------------------------
>
> Key: FLINK-15416
> URL: https://issues.apache.org/jira/browse/FLINK-15416
> Project: Flink
> Issue Type: Wish
> Components: Runtime / Network
> Affects Versions: 1.10.0
> Reporter: Zhenqiu Huang
> Priority: Major
>
> We run a flink with 256 TMs in production. The job internally has keyby
> logic. Thus, it builds a 256 * 256 communication channels. An outage happened
> when there is a chip internal link of one of the network switchs broken that
> connecting these machines. During the outage, the flink can't restart
> successfully as there is always an exception like "Connecting the channel
> failed: Connecting to remote task manager + '****/10.14.139.6:41300' has
> failed. This might indicate that the remote task manager has been lost.
> After deep investigation with the network infrastructure team, we found there
> are 6 switchs connecting with these machines. Each switch has 32 physcal
> links. Every socket is round-robin assigned to each of links for load
> balances. Thus, there is always average 256 * 256 / 6 * 32 * 2 = 170
> channels will be assigned to the broken link. The issue lasted for 4 hours
> until we found the broken link and restart the problematic switch.
> Given this, we found that the retry of creating channel will help to resolve
> this issue. For our networking topology, we can set retry to 2. As 170 / (132
> * 132) < 1, which means after retry twice no channel in 170 channels will be
> assigned to the broken link in the average case.
> I think it is valuable fix for this kind of partial network partition.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)