[
https://issues.apache.org/jira/browse/FLINK-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Metzger updated FLINK-14328:
-----------------------------------
Component/s: Deployment / Kubernetes
> JobCluster cannot reach TaskManager in K8s
> ------------------------------------------
>
> Key: FLINK-14328
> URL: https://issues.apache.org/jira/browse/FLINK-14328
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Reporter: Tim
> Priority: Major
> Fix For: 1.9.2
>
>
> I have a Job Cluster which I am running in K8s. It consists of
> * job manager deployment (1)
> * task manager deployment (1)
> * service
> This is more or less following the standard "Job Cluster" setup.
> Additionally, (due to known issues of TMs talking to JMs), I have set
> taskmanager.network.bind-policy to "ip", so that the task manager binds on
> the IP of the pod rather than the pod name (which is not reachable via DNS).
> So far so good.
>
> Once the cluster is started, I can see the job running. I also see that the
> JM's resource msnager has registered the TM.
> {code:java}
> 2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Slot Pool Status:
> status: connected to
> akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
> registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
> available slots: []
> allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @
> f34656491b8dfae726d992d276dc6d39 @
> data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815)
> - 0]]
> pending requests: []
> }
> {code}
> However, I see several errors like below, before the job eventually fails
> (maybe after 5 minutes), and goes into recovery. This happens until all
> restarts are exhaused, at which point the cluster completely fails.
> {code:java}
> 2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN
> akka.remote.ReliableDeliverySupervisor
> flink-akka.remote.default-remote-dispatcher-6 - Association with remote
> system [akka.tcp://[email protected]:50100] has failed, address is now gated
> for [50] ms. Reason: [Association failed with
> [akka.tcp://[email protected]:50100]] Caused by: [java.net.ConnectException:
> Connection refused: /10.107.38.92:50100]
> {code}
> {{To me it looks like the JM is not able to make a connection on the RPC port
> of the taskmanager (50100 is the taskmanager.rpc.port setting, and
> 10.107.38.92 is the IP address of the task manager pod as seen by "kubectl
> describe pod".)}}
> {{Has anyone come across this issue?}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)