[ 
https://issues.apache.org/jira/browse/FLINK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-1608:
--------------------------------
    Description: 
The TaskManager uses a NetUtils routine to pick a network interface that lets 
it talk to the Jobmanager. However, if the JobManager is not online yet, the 
TaskManager falls back to an arbitrary non-localhost device.

In cases where the TaskManagers start faster than the JobManager, they may pick 
the wrong interface (and associated address and hostname)

The later logic (that tries to connect to the JobManager actor) does several 
retries. I think we need similar logic when looking for a suitable network 
interface to use.

  was:
The taskmanagers use a NetUtils routine to find an interface that lets them 
talk to the Jobmanager. However, if the JobManager is not online yet, they fall 
back to some non-localhost device.

In cases where the TaskManagers start faster than the JobManager, they pick the 
wrong hostname and interface.

The later logic (that tries to connect to the JobManager actor) has a logic 
with retries. I think we need a similar logic here...


> TaskManagers may pick wrong network interface when starting before JobManager
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-1608
>                 URL: https://issues.apache.org/jira/browse/FLINK-1608
>             Project: Flink
>          Issue Type: Bug
>          Components: TaskManager
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>             Fix For: 0.9
>
>
> The TaskManager uses a NetUtils routine to pick a network interface that lets 
> it talk to the Jobmanager. However, if the JobManager is not online yet, the 
> TaskManager falls back to an arbitrary non-localhost device.
> In cases where the TaskManagers start faster than the JobManager, they may 
> pick the wrong interface (and associated address and hostname)
> The later logic (that tries to connect to the JobManager actor) does several 
> retries. I think we need similar logic when looking for a suitable network 
> interface to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to