[jira] [Updated] (FLINK-14328) JobCluster cannot reach TaskManager in K8s

Tim (Jira) Sat, 05 Oct 2019 13:54:09 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim updated FLINK-14328:
------------------------
    Description: 
I have a Job Cluster which I am running in K8s.  It consists of
 * job manager deployment (1)
 * task manager deployment (1)
 * service

This is more or less following the standard "Job Cluster" setup.   
Additionally, (due to known issues of TMs talking to JMs), I have set 
taskmanager.network.bind-policy to "ip", so that the task manager binds on the 
IP of the pod rather than the pod name (which is not reachable via DNS).   So 
far so good.

 

Once the cluster is started, I can see the job running.  I also see that the 
JM's resource msnager has registered the TM.
{code:java}
2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Slot Pool Status:
        status: connected to 
akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
        registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
        available slots: []
        allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ 
f34656491b8dfae726d992d276dc6d39 @ 
data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) - 
0]]
        pending requests: []
        }
{code}
However, I see several errors like below, before the job eventually fails 
(maybe after 5 minutes), and goes into recovery.   This happens until all 
restarts are exhaused, at which point the cluster completely fails.
{code:java}
2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-6 - Association with remote system 
[akka.tcp://[email protected]:50100] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://[email protected]:50100]] 
Caused by: [java.net.ConnectException: Connection refused: /10.107.38.92:50100]
{code}
{{To me it looks like the JM is not able to make a connection on the RPC port 
of the taskmanager (50100 is the taskmanager.rpc.port setting, and 10.107.38.92 
is the IP address of the task manager pod as seen by "kubectl describe pod".)}}

{{Has anyone come across this issue?}}

  was:
I have a Job Cluster which I am running in K8s.  It consists of
 * job manager deployment (1)
 * task manager deployment (1)
 * service

This is more or less following the standard "Job Cluster" setup.   
Additionally, (due to known issues of TMs talking to JMs), I have set 
taskmanager.network.bind-policy to "ip", so that the task manager binds on the 
IP of the pod rather than the pod name (which is not reachable via DNS).   So 
far so good.

 

Once the cluster is started, I can see the job running.  I also see that the 
JM's resource msnager has registered the TM.
{code:java}
2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Slot Pool Status:
        status: connected to 
akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
        registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
        available slots: []
        allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ 
f34656491b8dfae726d992d276dc6d39 @ 
data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) - 
0]]
        pending requests: []
        }
{code}
However, I see several errors like below, before the job eventually fails 
(maybe after 5 minutes), and goes into recovery.   This happens until all 
restarts are exhaused, at which point the cluster completely fails.
{code:java}
2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-6 - Association with remote system 
[akka.tcp://[email protected]:50100] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://[email protected]:50100]] 
Caused by: [java.net.ConnectException: Connection refused: /10.107.38.92:50100]
{code}
{{To me it looks like the JM is not able to make a connection on the RPC port 
of the taskmanager (50100 is the taskmanager.rpc.port setting).}}

{{Has anyone come across this issue?}}


> JobCluster cannot reach TaskManager in K8s
> ------------------------------------------
>
>                 Key: FLINK-14328
>                 URL: https://issues.apache.org/jira/browse/FLINK-14328
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Tim
>            Priority: Major
>             Fix For: 1.9.0
>
>
> I have a Job Cluster which I am running in K8s.  It consists of
>  * job manager deployment (1)
>  * task manager deployment (1)
>  * service
> This is more or less following the standard "Job Cluster" setup.   
> Additionally, (due to known issues of TMs talking to JMs), I have set 
> taskmanager.network.bind-policy to "ip", so that the task manager binds on 
> the IP of the pod rather than the pod name (which is not reachable via DNS).  
>  So far so good.
>  
> Once the cluster is started, I can see the job running.  I also see that the 
> JM's resource msnager has registered the TM.
> {code:java}
> 2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Slot Pool Status:
>         status: connected to 
> akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
>         registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
>         available slots: []
>         allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ 
> f34656491b8dfae726d992d276dc6d39 @ 
> data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) 
> - 0]]
>         pending requests: []
>         }
> {code}
> However, I see several errors like below, before the job eventually fails 
> (maybe after 5 minutes), and goes into recovery.   This happens until all 
> restarts are exhaused, at which point the cluster completely fails.
> {code:java}
> 2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN  
> akka.remote.ReliableDeliverySupervisor 
> flink-akka.remote.default-remote-dispatcher-6 - Association with remote 
> system [akka.tcp://[email protected]:50100] has failed, address is now gated 
> for [50] ms. Reason: [Association failed with 
> [akka.tcp://[email protected]:50100]] Caused by: [java.net.ConnectException: 
> Connection refused: /10.107.38.92:50100]
> {code}
> {{To me it looks like the JM is not able to make a connection on the RPC port 
> of the taskmanager (50100 is the taskmanager.rpc.port setting, and 
> 10.107.38.92 is the IP address of the task manager pod as seen by "kubectl 
> describe pod".)}}
> {{Has anyone come across this issue?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14328) JobCluster cannot reach TaskManager in K8s

Reply via email to