[
https://issues.apache.org/jira/browse/FLINK-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008557#comment-15008557
]
ASF GitHub Bot commented on FLINK-2967:
---------------------------------------
Github user StephanEwen commented on a diff in the pull request:
https://github.com/apache/flink/pull/1361#discussion_r45052228
--- Diff:
flink-runtime/src/main/java/org/apache/flink/runtime/net/ConnectionUtils.java
---
@@ -180,7 +180,41 @@ public static InetAddress
findConnectingAddress(InetSocketAddress targetAddress,
}
}
+ /**
+ * This utility method tries to connect to the JobManager using the
InetAddress returned by
+ * InetAddress.getLocalHost(). The purpose of the utility is to have a
final try connecting to
+ * the target address using the LocalHost before using the address
returned.
+ * We do a second try because the JM might have been unavailable during
the first check.
+ *
+ * @param preliminaryResult The address detected by the heuristic
+ * @return either the preliminaryResult or the address returned by
InetAddress.getLocalHost() (if
+ * we are able to connect to targetAddress from
there)
+ */
+ private static InetAddress tryLocalHostBeforeReturning(InetAddress
preliminaryResult, SocketAddress targetAddress, boolean logging) throws
IOException {
+ InetAddress localhostName = InetAddress.getLocalHost();
+ if(tryToConnect(localhostName, targetAddress,
AddressDetectionState.LOCAL_HOST.getTimeout(), logging)) {
--- End diff --
Also: code style, space
> TM address detection might not always detect the right interface on slow
> networks / overloaded JMs
> --------------------------------------------------------------------------------------------------
>
> Key: FLINK-2967
> URL: https://issues.apache.org/jira/browse/FLINK-2967
> Project: Flink
> Issue Type: Bug
> Affects Versions: 0.9, 0.10.0, 1.0.0
> Reporter: Robert Metzger
> Assignee: Robert Metzger
>
> I'm talking to a user which is facing the following issue:
> Some of the TaskManagers select the wrong IP address out of the available
> network interfaces.
> The first address we try to connect to is the one returned by
> {{InetAddress.getLocalHost()}}. This address is the right IP address to use,
> but the JobManager is not able to respond within the timeout (50ms) to that
> connection request.
> So the TM tries the next address, which is not publicly reachable. However,
> the TM can connect to the JM from there. Netty will later fail to connect to
> the TM from the other TMs.
> There are two solutions for this issue:
> - Allow users to configure a higher timeout for the first address detection
> strategy. In most cases, the address returned by
> {{InetAddress.getLocalHost()}} is correct. By setting a high timeout, users
> with slow networks / overloaded JMs can make sure the TM picks this address
> - add an Akka message which we send from the TM to the JM, and the JM tries
> to connect to the TM. If that succeeds, we know that the TM is reachable from
> the outside.
> The problem is that we have to start a separate actor system on the
> TaskManager first. We have to do this because might use a wrong ip address
> for the TM (so we might end up starting actor systems until we found an
> externally reachable ip)
> I'm first going to implement the first approach. If that solution works well
> for my user, I'll contribute this to 0.10 / 1.0.
> If not, I'll implement the second approach.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)