[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719651#comment-16719651 ]
Sergei Poganshev commented on FLINK-11127: ------------------------------------------ Here's a (slightly tricky) workaround that works: * configure *metrics.internal.query-service.port* property to some fixed port (e.g. *6666*) * in taskmanager deployment: ** expose that fixed port in taskmanager deployment spec ** use init container to put taskmanager's pod ip to flink configuration file: *** define a shared _emptyDir_ volume for *both* flink container and init container to use that flink container will mount to /etc/flink *** init container: **** add environment variable to init container definition that references *status.podIP* **** mount flink configuration folder to init container (anywhere other than /etc/flink) - this will be used as a basis for changes **** copy files from this folder to /etc/flink **** make init container command append pod ip to flink-conf.yaml as a value for *taskmanager.host* property *** since /etc/flink is mounted to both containers, changes in init container will be visible in flink container Configured this way jobmanager will connect to taskmanagers IPs directly to a configured fixed port. > Make metrics query service establish connection to JobManager > ------------------------------------------------------------- > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics > Affects Versions: 1.7.0 > Reporter: Ufuk Celebi > Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)