Scott Kidder created FLINK-7021:
-----------------------------------

             Summary: Flink Task Manager hangs on startup if one Zookeeper node 
is unresolvable
                 Key: FLINK-7021
                 URL: https://issues.apache.org/jira/browse/FLINK-7021
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.2.1, 1.3.0, 1.2.0
         Environment: Kubernetes cluster running:
* Flink 1.3.0 Job Manager & Task Manager on Java 8u131
* Zookeeper 3.4.10 cluster with 3 nodes
            Reporter: Scott Kidder


h2. Problem
Flink Task Manager will hang during startup if one of the Zookeeper nodes in 
the Zookeeper connection string is unresolvable.

h2. Expected Behavior
Flink should retry name resolution & connection to Zookeeper nodes with 
exponential back-off.

h2. Environment Details
We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in a 
configuration that automatically detects and applies operating system updates. 
We have a Zookeeper node running on the same CoreOS instance as Flink. It's 
possible that the Zookeeper node will not yet be started when the Flink 
components are started. This could cause hostname resolution of the Zookeeper 
nodes to fail.

h3. Flink Task Manager Logs
{noformat}
2017-06-27 15:38:51,713 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Using configured hostname/address for TaskManager: 10.2.45.11
2017-06-27 15:38:51,714 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Starting TaskManager
2017-06-27 15:38:51,714 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Starting TaskManager actor system at 10.2.45.11:6122.
2017-06-27 15:38:52,950 INFO  akka.event.slf4j.Slf4jLogger                      
            - Slf4jLogger started
2017-06-27 15:38:53,079 INFO  Remoting                                          
            - Starting remoting
2017-06-27 15:38:53,573 INFO  Remoting                                          
            - Remoting started; listening on addresses 
:[akka.tcp://flink@10.2.45.11:6122]
2017-06-27 15:38:53,576 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Starting TaskManager actor
2017-06-27 15:38:53,660 INFO  
org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig 
[server address: /10.2.45.11, server port: 6121, ssl enabled: false, memory 
segment size (bytes): 32768, transport type: NIO, number of server threads: 2 
(manual), number of client threads: 2 (manual), server connect backlog: 0 (use 
Netty's default), client connect timeout (sec): 120, send/receive buffer size 
(bytes): 0 (use Netty's default)]
2017-06-27 15:38:53,682 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have 
a max timeout of 10000 ms
2017-06-27 15:38:53,688 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerServices     - Temporary file 
directory '/tmp': total 49 GB, usable 42 GB (85.71% usable)
2017-06-27 15:38:54,071 INFO  
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 96 MB 
for network buffer pool (number of memory segments: 3095, bytes per segment: 
32768).
2017-06-27 15:38:54,564 INFO  
org.apache.flink.runtime.io.network.NetworkEnvironment        - Starting the 
network environment and its components.
2017-06-27 15:38:54,576 INFO  
org.apache.flink.runtime.io.network.netty.NettyClient         - Successful 
initialization (took 4 ms).
2017-06-27 15:38:54,677 INFO  
org.apache.flink.runtime.io.network.netty.NettyServer         - Successful 
initialization (took 101 ms). Listening on SocketAddress /10.2.45.11:6121.
2017-06-27 15:38:54,981 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerServices     - Limiting 
managed memory to 0.7 of the currently free heap space (612 MB), memory will be 
allocated lazily.
2017-06-27 15:38:55,050 INFO  
org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager 
uses directory /tmp/flink-io-ca01554d-f25e-4c17-a828-96d82b43d4a7 for spill 
files.
2017-06-27 15:38:55,061 INFO  org.apache.flink.runtime.metrics.MetricRegistry   
            - Configuring StatsDReporter with {interval=10 SECONDS, port=8125, 
host=localhost, class=org.apache.flink.metrics.statsd.StatsDReporter}.
2017-06-27 15:38:55,065 INFO  org.apache.flink.metrics.statsd.StatsDReporter    
            - Configured StatsDReporter with {host:localhost, port:8125}
2017-06-27 15:38:55,065 INFO  org.apache.flink.runtime.metrics.MetricRegistry   
            - Periodically reporting metrics in intervals of 10 SECONDS for 
reporter statsd of type org.apache.flink.metrics.statsd.StatsDReporter.
2017-06-27 15:38:55,175 INFO  org.apache.flink.runtime.filecache.FileCache      
            - User file cache uses directory 
/tmp/flink-dist-cache-e4c5bcc5-7513-40d9-a665-0d33c80a36ba
2017-06-27 15:38:55,187 INFO  org.apache.flink.runtime.filecache.FileCache      
            - User file cache uses directory 
/tmp/flink-dist-cache-310ba2f8-f96a-4c3f-b1db-35ac26b83f7e
2017-06-27 15:38:55,273 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Starting TaskManager actor at 
akka://flink/user/taskmanager#207081801.
2017-06-27 15:38:55,273 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - TaskManager data connection information: 
7f86855dac2af4cca9eb2ae4c046630e @ flink-taskmanager-3116622558-sqggc 
(dataPort=6121)
2017-06-27 15:38:55,273 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - TaskManager has 2 task slot(s).
2017-06-27 15:38:55,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Memory usage stats: [HEAP: 124/981/981 MB, NON HEAP: 43/44/-1 MB 
(used/committed/max)]
2017-06-27 15:38:55,276 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Starting ZooKeeperLeaderRetrievalService.
2017-06-27 15:39:10,289 ERROR 
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection 
timed out for connection string 
(zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181)
 and timeout (15000) / elapsed (18617)
org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: 
KeeperErrorCode = ConnectionLoss
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
        at 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100)
        at 
org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205)
        at akka.actor.Actor$class.aroundPreStart(Actor.scala:472)
        at 
org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120)
        at akka.actor.ActorCell.create(ActorCell.scala:580)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-06-27 15:39:30,349 INFO  org.apache.zookeeper.ZooKeeper                    
            - Initiating client connection, 
connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
 sessionTimeout=60000 
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af
2017-06-27 15:40:00,388 WARN  
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection 
attempt unsuccessful after 68719 (greater than max timeout of 60000). Resetting 
connection and trying again with a new connection.
2017-06-27 15:40:00,388 INFO  org.apache.zookeeper.ZooKeeper                    
            - Initiating client connection, 
connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
 sessionTimeout=60000 
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af
2017-06-27 15:40:00,450 ERROR 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  
- Ensure path threw exception
java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
        at 
org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
        at 
org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
        at 
org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
        at 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100)
        at 
org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205)
        at akka.actor.Actor$class.aroundPreStart(Actor.scala:472)
        at 
org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120)
        at akka.actor.ActorCell.create(ActorCell.scala:580)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to