[
https://issues.apache.org/jira/browse/FLINK-15911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048809#comment-17048809
]
Xintong Song commented on FLINK-15911:
--------------------------------------
Some updates on this ticket.
I've decoupled bind-address/bind-port from the address/port of JM/TM RPC
services, verified successfully with a docker-based e2e test with the default
parallelism 1. But I run into problems when increasing the parallelism to have
multiple TMs, because TMs failed to find each other's Netty shuffle
address/port.
I talked to [~zjwang] offline. He confirmed that Netty shuffle service uses TM
address in two ways:
* The address passed into NettyShuffleEnvironment is used for binding to the
local address. It should use the bind-address.
* The address wrapped in TaskManagerLocation will be sent to JobMaster, which
will be used by tasks for accessing the TM's shuffle service.
I will continue trying to resolve address/port problem of Netty shuffle service.
In addition, the address/port and bind-address/bind-port of the following
services may also need to separated. I would like to exclude them from the
scope of this ticket, to keep a minimum set of changes in this ticket for
getting Flink work over NAT.
* Blob Server on JM. This is only needed if we we want to submit jobs from
outside of NAT to a Flink session cluster whose JM runs behind NAT. I will try
to address this in FLINK-15154.
* KvStateService on TM. This is only used for queryable state, which I'm not
sure how many use cases do we have. Also, I'm not familiar with how the
KvStateService works. If we want to get it work over NAT, I would need help
from someone familiar with it.
> Flink does not work over NAT
> ----------------------------
>
> Key: FLINK-15911
> URL: https://issues.apache.org/jira/browse/FLINK-15911
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Till Rohrmann
> Assignee: Xintong Song
> Priority: Blocker
> Labels: usability
> Fix For: 1.11.0
>
>
> Currently, it is not possible to run Flink over network address translation.
> The problem is that the Flink processes do not allow to specify separate bind
> and external ports. Moreover, the {{TaskManager}} tries to resolve the given
> {{taskmanager.host}} which might not be resolvable. This breaks NAT or docker
> setups where the external address is not resolvable from within the
> container/internal network.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)