Alex created FLINK-11632:
----------------------------
Summary: Make TaskManager automatic bind address picking more
explicit (by default) and more configurable
Key: FLINK-11632
URL: https://issues.apache.org/jira/browse/FLINK-11632
Project: Flink
Issue Type: Improvement
Components: Distributed Coordination, Network, TaskManager
Reporter: Alex
Currently, there is an optional {{taskmanager.host}} configuration option in
{{flink-conf.yaml}} that allows users of Flink to "statically" pre-define what
should be a bind address for TaskManager to listen on (note: it's also possible
to override this option by passing corresponding command line option to Flink).
In case when the option is not set, TaskManager would try [heuristically pick
up a bind
address|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L421-L442].
The resulting address (hostname) is used to advertise different service
endpoints (running in TM) to the JobManager. Also it would be resolved to an
{{[InetAddress|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L359]}}
later that used as binding address for TMs inner node communication.
This proposal is to minimize usage of heuristics (by default) by introducing a
new configuration option (for example, {{taskmanager.host.bind-policy}}) with
possible values:
* {{"hostname"}} - default, use TM's host's name ({{==
InetAddress.getLocalHost().getHostName()}};
* {{"ip"}} - use TM's host's ip address ({{==
InetAddress.getLocalHost().getHostAddress()}});
* {{"auto-detect-hostname"}} - use the heuristics based detection mechanism.
*Note:* the configuration key and values could be named better and open for
proposals.
*Note 2:* in the future, the configuration option _may_ require to be extended
to allow choosing some specific network interface, or preference of ipv6 vs
ipv4.
h3. Rationale
[The heuristics
mechanism|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/net/ConnectionUtils.java#L364-L475]
tries to establish a probe connection to {{jobmanager.rpc.address}} from
different network interface addresses.
In case of parallel setups (when JM and multiple TMs start simultaneously, in
parallel), this depends on timing, assigned network ip addresses and may end up
with "non-uniform" address bindings of TMs (some may be "lucky" to pick up non
default network interface, some would fallback to
{{InetAddress.getLocalHost().getHostName()}}. At the end, it's less obvious and
transparent which binding address a TM picks up.
In practice, it's possible that in majority of cases (in well setup
environments) the heuristics mechanism returns a result that matches
{{InetAddress.getLocalHost()}}. The proposal is to stick with this more simpler
and explicit binding (by default), avoiding non-determinism of heuristics.
The old mechanism is kept available, in case if it is useful in some setups.
But would require explicit configuration setting.
Additionally, this proposal extends "auto configuration" option by allowing
users to choose the host's ip address (instead of hostname). This may be
convenient in situations where the TMs' machines are not necessary reachable
via DNS (for example in a Kubernetes setup).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)