[ 
https://issues.apache.org/jira/browse/FLINK-15911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048809#comment-17048809
 ] 

Xintong Song commented on FLINK-15911:
--------------------------------------

Some updates on this ticket.

I've decoupled bind-address/bind-port from the address/port of JM/TM RPC 
services, verified successfully with a docker-based e2e test with the default 
parallelism 1. But I run into problems when increasing the parallelism to have 
multiple TMs, because TMs failed to find each other's Netty shuffle 
address/port.

I talked to [~zjwang] offline. He confirmed that Netty shuffle service uses TM 
address in two ways:
 * The address passed into NettyShuffleEnvironment is used for binding to the 
local address. It should use the bind-address.
 * The address wrapped in TaskManagerLocation will be sent to JobMaster, which 
will be used by tasks for accessing the TM's shuffle service.

I will continue trying to resolve address/port problem of Netty shuffle service.

In addition, the address/port and bind-address/bind-port of the following 
services may also need to separated. I would like to exclude them from the 
scope of this ticket, to keep a minimum set of changes in this ticket for 
getting Flink work over NAT.
 * Blob Server on JM. This is only needed if we we want to submit jobs from 
outside of NAT to a Flink session cluster whose JM runs behind NAT. I will try 
to address this in FLINK-15154.
 * KvStateService on TM. This is only used for queryable state, which I'm not 
sure how many use cases do we have. Also, I'm not familiar with how the 
KvStateService works. If we want to get it work over NAT, I would need help 
from someone familiar with it.

> Flink does not work over NAT
> ----------------------------
>
>                 Key: FLINK-15911
>                 URL: https://issues.apache.org/jira/browse/FLINK-15911
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Till Rohrmann
>            Assignee: Xintong Song
>            Priority: Blocker
>              Labels: usability
>             Fix For: 1.11.0
>
>
> Currently, it is not possible to run Flink over network address translation. 
> The problem is that the Flink processes do not allow to specify separate bind 
> and external ports. Moreover, the {{TaskManager}} tries to resolve the given 
> {{taskmanager.host}} which might not be resolvable. This breaks NAT or docker 
> setups where the external address is not resolvable from within the 
> container/internal network.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to