[
https://issues.apache.org/jira/browse/FLINK-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann resolved FLINK-7021.
----------------------------------
Resolution: Fixed
Fixed via 45aceb4d8d0e2f4b8af2fc04c0ff403e7fc001b6
> Flink Task Manager hangs on startup if one Zookeeper node is unresolvable
> -------------------------------------------------------------------------
>
> Key: FLINK-7021
> URL: https://issues.apache.org/jira/browse/FLINK-7021
> Project: Flink
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.2.0, 1.3.0, 1.2.1, 1.3.1
> Environment: Kubernetes cluster running:
> * Flink 1.3.0 Job Manager & Task Manager on Java 8u131
> * Zookeeper 3.4.10 cluster with 3 nodes
> Reporter: Scott Kidder
> Assignee: Scott Kidder
> Priority: Blocker
> Fix For: 1.4.0
>
>
> h2. Problem
> Flink Task Manager will hang during startup if one of the Zookeeper nodes in
> the Zookeeper connection string is unresolvable.
> h2. Expected Behavior
> Flink should retry name resolution & connection to Zookeeper nodes with
> exponential back-off.
> h2. Environment Details
> We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in
> a configuration that automatically detects and applies operating system
> updates. We have a Zookeeper node running on the same CoreOS instance as
> Flink. It's possible that the Zookeeper node will not yet be started when the
> Flink components are started. This could cause hostname resolution of the
> Zookeeper nodes to fail.
> h3. Flink Task Manager Logs
> {noformat}
> 2017-06-27 15:38:51,713 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Using
> configured hostname/address for TaskManager: 10.2.45.11
> 2017-06-27 15:38:51,714 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Starting
> TaskManager
> 2017-06-27 15:38:51,714 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Starting
> TaskManager actor system at 10.2.45.11:6122.
> 2017-06-27 15:38:52,950 INFO akka.event.slf4j.Slf4jLogger
> - Slf4jLogger started
> 2017-06-27 15:38:53,079 INFO Remoting
> - Starting remoting
> 2017-06-27 15:38:53,573 INFO Remoting
> - Remoting started; listening on addresses
> :[akka.tcp://[email protected]:6122]
> 2017-06-27 15:38:53,576 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Starting
> TaskManager actor
> 2017-06-27 15:38:53,660 INFO
> org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig
> [server address: /10.2.45.11, server port: 6121, ssl enabled: false, memory
> segment size (bytes): 32768, transport type: NIO, number of server threads: 2
> (manual), number of client threads: 2 (manual), server connect backlog: 0
> (use Netty's default), client connect timeout (sec): 120, send/receive buffer
> size (bytes): 0 (use Netty's default)]
> 2017-06-27 15:38:53,682 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages
> have a max timeout of 10000 ms
> 2017-06-27 15:38:53,688 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary
> file directory '/tmp': total 49 GB, usable 42 GB (85.71% usable)
> 2017-06-27 15:38:54,071 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 96
> MB for network buffer pool (number of memory segments: 3095, bytes per
> segment: 32768).
> 2017-06-27 15:38:54,564 INFO
> org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the
> network environment and its components.
> 2017-06-27 15:38:54,576 INFO
> org.apache.flink.runtime.io.network.netty.NettyClient - Successful
> initialization (took 4 ms).
> 2017-06-27 15:38:54,677 INFO
> org.apache.flink.runtime.io.network.netty.NettyServer - Successful
> initialization (took 101 ms). Listening on SocketAddress /10.2.45.11:6121.
> 2017-06-27 15:38:54,981 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting
> managed memory to 0.7 of the currently free heap space (612 MB), memory will
> be allocated lazily.
> 2017-06-27 15:38:55,050 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager
> uses directory /tmp/flink-io-ca01554d-f25e-4c17-a828-96d82b43d4a7 for spill
> files.
> 2017-06-27 15:38:55,061 INFO org.apache.flink.runtime.metrics.MetricRegistry
> - Configuring StatsDReporter with {interval=10 SECONDS,
> port=8125, host=localhost,
> class=org.apache.flink.metrics.statsd.StatsDReporter}.
> 2017-06-27 15:38:55,065 INFO org.apache.flink.metrics.statsd.StatsDReporter
> - Configured StatsDReporter with {host:localhost, port:8125}
> 2017-06-27 15:38:55,065 INFO org.apache.flink.runtime.metrics.MetricRegistry
> - Periodically reporting metrics in intervals of 10 SECONDS for
> reporter statsd of type org.apache.flink.metrics.statsd.StatsDReporter.
> 2017-06-27 15:38:55,175 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-e4c5bcc5-7513-40d9-a665-0d33c80a36ba
> 2017-06-27 15:38:55,187 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-310ba2f8-f96a-4c3f-b1db-35ac26b83f7e
> 2017-06-27 15:38:55,273 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Starting
> TaskManager actor at akka://flink/user/taskmanager#207081801.
> 2017-06-27 15:38:55,273 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager
> data connection information: 7f86855dac2af4cca9eb2ae4c046630e @
> flink-taskmanager-3116622558-sqggc (dataPort=6121)
> 2017-06-27 15:38:55,273 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager
> has 2 task slot(s).
> 2017-06-27 15:38:55,276 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory usage
> stats: [HEAP: 124/981/981 MB, NON HEAP: 43/44/-1 MB (used/committed/max)]
> 2017-06-27 15:38:55,276 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Starting ZooKeeperLeaderRetrievalService.
> 2017-06-27 15:39:10,289 ERROR
> org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection
> timed out for connection string
> (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181)
> and timeout (15000) / elapsed (18617)
> org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at
> org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
> at
> org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
> at
> org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
> at
> org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
> at
> org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
> at
> org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
> at
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205)
> at akka.actor.Actor$class.aroundPreStart(Actor.scala:472)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120)
> at akka.actor.ActorCell.create(ActorCell.scala:580)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2017-06-27 15:39:30,349 INFO org.apache.zookeeper.ZooKeeper
> - Initiating client connection,
> connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
> sessionTimeout=60000
> watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af
> 2017-06-27 15:40:00,388 WARN
> org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection
> attempt unsuccessful after 68719 (greater than max timeout of 60000).
> Resetting connection and trying again with a new connection.
> 2017-06-27 15:40:00,388 INFO org.apache.zookeeper.ZooKeeper
> - Initiating client connection,
> connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
> sessionTimeout=60000
> watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af
> 2017-06-27 15:40:00,450 ERROR
> org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - Ensure path threw exception
> java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not
> known
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> at
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> at
> org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
> at
> org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
> at
> org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
> at
> org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128)
> at
> org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77)
> at
> org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261)
> at
> org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221)
> at
> org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
> at
> org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
> at
> org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
> at
> org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
> at
> org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
> at
> org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594)
> at
> org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158)
> at
> org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
> at
> org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
> at
> org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
> at
> org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
> at
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205)
> at akka.actor.Actor$class.aroundPreStart(Actor.scala:472)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120)
> at akka.actor.ActorCell.create(ActorCell.scala:580)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)