[ 
https://issues.apache.org/jira/browse/FLINK-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902150#comment-14902150
 ] 

ASF GitHub Bot commented on FLINK-2722:
---------------------------------------

Github user rmetzger commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1159#discussion_r40058785
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/net/NetUtils.java ---
    @@ -189,9 +191,17 @@ public static InetAddress 
findConnectingAddress(InetSocketAddress targetAddress,
                long currentSleepTime = MIN_SLEEP_TIME;
                long elapsedTime = 0;
     
    +           // before trying with different strategies: test with 
getLocalHost():
    +           InetAddress localhostName = InetAddress.getLocalHost();
    +
    +           if(tryToConnect(localhostName, targetAddress, 
AddressDetectionState.ADDRESS.getTimeout(), false)) {
    +                   LOG.debug("Using immediately InetAddress.getLocalHost() 
for the connecting address");
    --- End diff --
    
    These are the produced log statements in `DEBUG` level:
    ```
    16:12:19,822 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils       
     - Trying to select the network interface and address to use by connecting 
to the leading JobManager.
    16:12:19,822 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils       
     - TaskManager will try to connect for 10000 milliseconds before falling 
back to heuristics
    16:12:19,833 INFO  org.apache.flink.runtime.net.NetUtils                    
     - Retrieved new target address /10.240.221.7:33378.
    16:12:19,835 DEBUG org.apache.flink.runtime.net.NetUtils                    
     - Trying to connect to (/10.240.221.7:33378) from local address 
cdh544-master.c.astral-sorter-757.internal/10.240.242.143 with timeout 50
    16:12:19,838 DEBUG org.apache.flink.runtime.net.NetUtils                    
     - Using immediately InetAddress.getLocalHost() for the connecting address
    16:12:19,839 INFO  org.apache.flink.runtime.taskmanager.TaskManager         
     - TaskManager will use hostname/address 
'cdh544-master.c.astral-sorter-757.internal' (10.240.242.143) for communication.
    16:12:19,839 INFO  org.apache.flink.runtime.taskmanager.TaskManager         
     - Starting TaskManager in streaming mode BATCH_ONLY
    16:12:19,839 INFO  org.apache.flink.runtime.taskmanager.TaskManager         
     - Starting TaskManager actor system at 
cdh544-master.c.astral-sorter-757.internal:0
    ```
    I think the messages in `INFO` level contain enough information for users 
to understand whats going on.


> Use InetAddress.getLocalHost() first when detecting TaskManager IP address
> --------------------------------------------------------------------------
>
>                 Key: FLINK-2722
>                 URL: https://issues.apache.org/jira/browse/FLINK-2722
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime, TaskManager
>    Affects Versions: 0.9, 0.10
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>             Fix For: 0.9.2
>
>
> A user reported a connection issue with Netty being unable to connect to a 
> TaskManager to subscribe to an intermediate result.
> The problem occurred when the TaskManager and JobManager were running on the 
> same host (something that can easily happen on YARN).
> In that case, the TaskManager was reporting a host-local ip address to the 
> JobManager when connecting.
> To avoid the issue in the future, the TaskManager first tries to use the 
> hostname returned by InetAddress.getLocalHost(). In a properly set-up 
> environment, this will return a connection which is accessible by all 
> machines in a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to