[ 
https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719651#comment-16719651
 ] 

Sergei Poganshev commented on FLINK-11127:
------------------------------------------

Here's a (slightly tricky) workaround that works:
 * configure *metrics.internal.query-service.port* property to some fixed port 
(e.g. *6666*)
 * in taskmanager deployment:
 ** expose that fixed port in taskmanager deployment spec
 ** use init container to put taskmanager's pod ip to flink configuration file:
 *** define a shared _emptyDir_ volume for *both* flink container and init 
container to use that flink container will mount to /etc/flink
 *** init container:
 **** add environment variable to init container definition that references 
*status.podIP*
 **** mount flink configuration folder to init container (anywhere other than 
/etc/flink) - this will be used as a basis for changes
 **** copy files from this folder to /etc/flink
 **** make init container command append pod ip to flink-conf.yaml as a value 
for *taskmanager.host* property
 *** since /etc/flink is mounted to both containers, changes in init container 
will be visible in flink container

Configured this way jobmanager will connect to taskmanagers IPs directly to a 
configured fixed port.

> Make metrics query service establish connection to JobManager
> -------------------------------------------------------------
>
>                 Key: FLINK-11127
>                 URL: https://issues.apache.org/jira/browse/FLINK-11127
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination, Kubernetes, Metrics
>    Affects Versions: 1.7.0
>            Reporter: Ufuk Celebi
>            Priority: Major
>
> As part of FLINK-10247, the internal metrics query service has been separated 
> into its own actor system. Before this change, the JobManager (JM) queried 
> TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a 
> separate connection to the TM metrics query service actor.
> In the context of Kubernetes, this is problematic as the JM will typically 
> *not* be able to resolve the TMs by name, resulting in warnings as follows:
> {code}
> 2018-12-11 08:32:33,962 WARN  akka.remote.ReliableDeliverySupervisor          
>               - Association with remote system 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has 
> failed, address is now gated for [50] ms. Reason: [Association failed with 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused 
> by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve]
> {code}
> In order to expose the TMs by name in Kubernetes, users require a service 
> *for each* TM instance which is not practical.
> This currently results in the web UI not being to display some basic metrics 
> about number of sent records. You can reproduce this by following the READMEs 
> in {{flink-container/kubernetes}}.
> This worked before, because the JM is typically exposed via a service with a 
> known name and the TMs establish the connection to it which the metrics query 
> service piggybacked on.
> A potential solution to this might be to let the query service connect to the 
> JM similar to how the TMs register.
> I tagged this ticket as an improvement, but in the context of Kubernetes I 
> would consider this to be a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to