Joshua Griffith created FLINK-7340:
--------------------------------------

             Summary: Taskmanager hung after temporary DNS outage
                 Key: FLINK-7340
                 URL: https://issues.apache.org/jira/browse/FLINK-7340
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.3.1
         Environment: Non-HA Flink running in Kubernetes.
            Reporter: Joshua Griffith


After a Kubernetes node failure, several TaskManagers and the DNS system were 
automatically restarted. One TaskManager was unable to connect to the 
JobManager and continually logged the following errors:

{quote}
2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO  
org.apache.flink.runtime.taskmanager.TaskManager  - Trying to register at 
JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 595, 
timeout: 30000 milliseconds)
2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO  
Remoting flink-akka.remote.default-remote-dispatcher-240 - Quarantined address 
[akka.tcp://flink@jobmanager:6123] is still unreachable or has not been 
restarted. Keeping it quarantined.
{quote}

After exec'ing into the container, I was able to {{telnet jobmanager 6123}} 
successfully and {{dig jobmanager}} showed the correct IP in DNS. I suspect 
that the TaskManager cached a bad IP address for the JobManager when the DNS 
system was restarting and it used that cached address rather than respecting 
the 30s TTL and getting a new one for the next request. It may be a good idea 
for the TaskManager to explicitly perform a DNS lookup after JobManager 
connection failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to