[ https://issues.apache.org/jira/browse/FLINK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann closed FLINK-7022. -------------------------------- Resolution: Not A Problem No longer a problem. > Flink Job Manager Scheduler & Web Frontend out of sync when Zookeeper is > unavailable on startup > ----------------------------------------------------------------------------------------------- > > Key: FLINK-7022 > URL: https://issues.apache.org/jira/browse/FLINK-7022 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.2.0, 1.2.1, 1.3.0 > Environment: Kubernetes cluster running: > * Flink 1.3.0 Job Manager & Task Manager on Java 8u131 > * Zookeeper 3.4.10 cluster with 3 nodes > Reporter: Scott Kidder > Priority: Major > > h2. Problem > Flink Job Manager web frontend is permanently unavailable if one or more > Zookeeper nodes are unresolvable during startup. The job scheduler eventually > recovers and assigns jobs to task managers, but the web frontend continues to > respond with an HTTP 503 and the following message: > {noformat}Service temporarily unavailable due to an ongoing leader election. > Please refresh.{noformat} > h2. Expected Behavior > Once Flink is able to interact with Zookeeper successfully, all aspects of > the Job Manager (job scheduling & the web frontend) should be available. > h2. Environment Details > We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in > a configuration that automatically detects and applies operating system > updates. We have a Zookeeper node running on the same CoreOS instance as > Flink. It's possible that the Zookeeper node will not yet be started when the > Flink components are started. This could cause hostname resolution of the > Zookeeper nodes to fail. > h3. Flink Task Manager Logs > {noformat} > 2017-06-27 15:38:47,161 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.host, localhost > 2017-06-27 15:38:47,161 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.port, 8125 > 2017-06-27 15:38:47,162 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.interval, 10 SECONDS > 2017-06-27 15:38:47,254 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2017-06-27 15:38:47,254 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend.fs.checkpointdir, > hdfs://hdfs:8020/flink/checkpoints > 2017-06-27 15:38:47,255 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.savepoints.dir, > hdfs://hdfs:8020/flink/savepoints > 2017-06-27 15:38:47,255 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.mode, zookeeper > 2017-06-27 15:38:47,256 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.zookeeper.quorum, > zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 > 2017-06-27 15:38:47,256 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.zookeeper.storageDir, > hdfs://hdfs:8020/flink/recovery > 2017-06-27 15:38:47,256 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.jobmanager.port, 6123 > 2017-06-27 15:38:47,257 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: blob.server.port, 41479 > 2017-06-27 15:38:47,357 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key 'recovery.mode' > instead of proper key 'high-availability' > 2017-06-27 15:38:47,366 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager with high-availability > 2017-06-27 15:38:47,366 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key > 'recovery.jobmanager.port' instead of proper key > 'high-availability.jobmanager.port' > 2017-06-27 15:38:47,452 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager on flink:6123 with execution mode CLUSTER > 2017-06-27 15:38:47,549 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, flink > 2017-06-27 15:38:47,549 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-06-27 15:38:47,549 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 1024 > 2017-06-27 15:38:47,549 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.heap.mb, 1024 > 2017-06-27 15:38:47,549 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 1 > 2017-06-27 15:38:47,549 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2017-06-27 15:38:47,550 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: parallelism.default, 1 > 2017-06-27 15:38:47,550 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.web.port, 8081 > 2017-06-27 15:38:47,550 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporters, statsd > 2017-06-27 15:38:47,550 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.class, > org.apache.flink.metrics.statsd.StatsDReporter > 2017-06-27 15:38:47,551 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.host, localhost > 2017-06-27 15:38:47,551 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.port, 8125 > 2017-06-27 15:38:47,551 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.interval, 10 SECONDS > 2017-06-27 15:38:47,551 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2017-06-27 15:38:47,551 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend.fs.checkpointdir, > hdfs://hdfs:8020/flink/checkpoints > 2017-06-27 15:38:47,552 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.savepoints.dir, > hdfs://hdfs:8020/flink/savepoints > 2017-06-27 15:38:47,552 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.mode, zookeeper > 2017-06-27 15:38:47,552 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.zookeeper.quorum, > zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 > 2017-06-27 15:38:47,552 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.zookeeper.storageDir, > hdfs://hdfs:8020/flink/recovery > 2017-06-27 15:38:47,552 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: recovery.jobmanager.port, 6123 > 2017-06-27 15:38:47,552 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: blob.server.port, 41479 > 2017-06-27 15:38:48,055 INFO > org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user > set to root (auth:SIMPLE) > 2017-06-27 15:38:48,664 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager actor system reachable at flink:6123 > 2017-06-27 15:38:50,955 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2017-06-27 15:38:51,252 INFO Remoting > - Starting remoting > 2017-06-27 15:38:52,679 INFO Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@flink:6123] > 2017-06-27 15:38:52,758 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key 'recovery.mode' > instead of proper key 'high-availability' > 2017-06-27 15:38:52,761 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key 'recovery.mode' > instead of proper key 'high-availability' > 2017-06-27 15:38:52,764 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key > 'recovery.zookeeper.storageDir' instead of proper key > 'high-availability.storageDir' > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, flink > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 1024 > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.heap.mb, 1024 > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 1 > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: parallelism.default, 1 > 2017-06-27 15:38:52,854 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.web.port, 8081 > 2017-06-27 15:38:52,864 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporters, statsd > 2017-06-27 15:38:52,865 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.class, > org.apache.flink.metrics.statsd.StatsDReporter > 2017-06-27 15:38:52,865 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.host, localhost > 2017-06-27 15:38:52,865 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.port, 8125 > 2017-06-27 15:38:52,865 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: metrics.reporter.statsd.interval, 10 SECONDS > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > at > org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150) > at > org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) > at > org.apache.flink.shaded.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:262) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.start(ConnectionState.java:109) > at > org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:191) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259) > at > org.apache.flink.runtime.util.ZooKeeperUtils.startCuratorFramework(ZooKeeperUtils.java:128) > at > org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:96) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2047) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) > at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) > 2017-06-27 15:38:59,160 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager web frontend > 2017-06-27 15:38:59,257 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager log file: > /usr/local/flink-1.3.0/log/flink--jobmanager-0-flink-jobmanager-3380372638-1q7jb.log > 2017-06-27 15:38:59,257 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager stdout file: > /usr/local/flink-1.3.0/log/flink--jobmanager-0-flink-jobmanager-3380372638-1q7jb.out > 2017-06-27 15:38:59,257 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory /tmp/flink-web-252afcf4-d41d-4095-a082-f6ce5176c2f5 for the web > interface files > 2017-06-27 15:38:59,257 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory /tmp/flink-web-2ca2cadf-a1b6-44af-9510-9c523a422022 for web > frontend JAR file uploads > 2017-06-27 15:39:01,060 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend > listening at 0:0:0:0:0:0:0:0:8081 > 2017-06-27 15:39:01,060 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager actor > 2017-06-27 15:39:01,253 INFO org.apache.flink.runtime.blob.BlobServer > - Created BLOB server storage directory > /tmp/blobStore-1f49aadd-0a7d-45d1-8fdc-fc2167ca93d5 > 2017-06-27 15:39:01,257 INFO org.apache.flink.runtime.blob.BlobServer > - Started BLOB server at 0.0.0.0:41479 - max concurrent > requests: 50 - max backlog: 1000 > 2017-06-27 15:39:01,851 INFO org.apache.flink.runtime.metrics.MetricRegistry > - Configuring StatsDReporter with {interval=10 SECONDS, > port=8125, host=localhost, > class=org.apache.flink.metrics.statsd.StatsDReporter}. > 2017-06-27 15:39:01,948 INFO org.apache.flink.metrics.statsd.StatsDReporter > - Configured StatsDReporter with {host:localhost, port:8125} > 2017-06-27 15:39:01,949 INFO org.apache.flink.runtime.metrics.MetricRegistry > - Periodically reporting metrics in intervals of 10 SECONDS for > reporter statsd of type org.apache.flink.metrics.statsd.StatsDReporter. > 2017-06-27 15:39:02,050 INFO > org.apache.flink.runtime.jobmanager.MemoryArchivist - Started > memory archivist akka://flink/user/archive > 2017-06-27 15:39:02,059 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key > 'recovery.zookeeper.storageDir' instead of proper key > 'high-availability.storageDir' > 2017-06-27 15:39:17,252 ERROR > org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection > timed out for connection string > (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) > and timeout (15000) / elapsed (18395) > org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) > at > org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90) > at > org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.newNamespaceAwareEnsurePath(NamespaceImpl.java:109) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.newNamespaceAwareEnsurePath(CuratorFrameworkImpl.java:469) > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:116) > at > org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) > at > org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) > at > org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) > at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) > 2017-06-27 15:39:37,448 INFO org.apache.zookeeper.ZooKeeper > - Initiating client connection, > connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 > sessionTimeout=60000 > watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf > 2017-06-27 15:40:07,457 WARN > org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection > attempt unsuccessful after 68603 (greater than max timeout of 60000). > Resetting connection and trying again with a new connection. > 2017-06-27 15:40:07,457 INFO org.apache.zookeeper.ZooKeeper > - Initiating client connection, > connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 > sessionTimeout=60000 > watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf > 2017-06-27 15:40:07,555 ERROR > org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Ensure path threw exception > java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not > known > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > at > org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150) > at > org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) > at > org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128) > at > org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) > at > org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90) > at > org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.newNamespaceAwareEnsurePath(NamespaceImpl.java:109) > at > org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.newNamespaceAwareEnsurePath(CuratorFrameworkImpl.java:469) > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:116) > at > org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) > at > org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) > at > org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) > at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) > 2017-06-27 15:40:22,566 ERROR > org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection > timed out for connection string > (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) > and timeout (15000) / elapsed (15108) > org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) > at > org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) > at > org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) > at > org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) > at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) > 2017-06-27 15:40:42,575 INFO org.apache.zookeeper.ZooKeeper > - Initiating client connection, > connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 > sessionTimeout=60000 > watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf > 2017-06-27 15:41:02,684 ERROR > org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection > timed out for connection string > (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) > and timeout (15000) / elapsed (55226) > org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) > at > org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) > at > org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) > at > org.apache.flink.shaded.org.apache.curator.utils.EnsurePath$InitialHelper$1.call(EnsurePath.java:156) > at > org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) > at > org.apache.flink.shaded.org.apache.curator.utils.EnsurePath$InitialHelper.ensure(EnsurePath.java:149) > at > org.apache.flink.shaded.org.apache.curator.utils.EnsurePath.ensure(EnsurePath.java:102) > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:117) > at > org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) > at > org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) > at > org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) > at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) > 2017-06-27 15:41:02,684 INFO org.apache.zookeeper.ZooKeeper > - Initiating client connection, > connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 > sessionTimeout=60000 > watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf > 2017-06-27 15:41:02,803 WARN org.apache.zookeeper.ClientCnxn > - SASL configuration failed: > javax.security.auth.login.LoginException: No JAAS configuration section named > 'Client' was found in specified JAAS configuration file: > '/tmp/jaas-1381454376626202001.conf'. Will continue connection to Zookeeper > server without SASL authentication, if Zookeeper server allows it. > 2017-06-27 15:41:02,804 ERROR > org.apache.flink.shaded.org.apache.curator.ConnectionState - > Authentication failed > 2017-06-27 15:41:02,806 INFO org.apache.zookeeper.ClientCnxn > - Opening socket connection to server > ip-10-2-8-5.ec2.internal/10.2.8.5:2181 > ... > 2017-06-27 16:00:51,490 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to > restart or fail the job (022d8149808dd3297a8a7275a1fd3d6b) if no longer > possible. > 2017-06-27 16:00:51,490 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > (022d8149808dd3297a8a7275a1fd3d6b) switched from state FAILING to RESTARTING. > 2017-06-27 16:00:51,490 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting > the job (022d8149808dd3297a8a7275a1fd3d6b). > 2017-06-27 16:00:51,490 INFO > org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestarter - > Delaying retry of job execution for 10000 ms ... > 2017-06-27 16:00:58,252 INFO org.apache.flink.runtime.jobmanager.JobManager > - Task Manager Registration but not connected to ResourceManager > 2017-06-27 16:00:58,254 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at flink-taskmanager-3116622558-zmmwq > (akka.tcp://flink@10.2.8.11:6122/user/taskmanager) as > 2a058f00bd1e25f44c1cb8f3e5dd726f. Current number of registered hosts is 1. > Current number of alive task slots is 2. > 2017-06-27 16:00:58,453 INFO org.apache.flink.runtime.jobmanager.JobManager > - Task Manager Registration but not connected to ResourceManager > 2017-06-27 16:01:01,491 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > (022d8149808dd3297a8a7275a1fd3d6b) switched from state RESTARTING to CREATED. > 2017-06-27 16:01:01,491 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2017-06-27 16:01:01,645 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Found 1 checkpoints in ZooKeeper. > 2017-06-27 16:01:01,645 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 502. > 2017-06-27 16:01:01,660 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring > from latest valid checkpoint: Checkpoint 502 @ 1498577858587 for > 022d8149808dd3297a8a7275a1fd3d6b. > 2017-06-27 16:01:01,661 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master > state to restore > 2017-06-27 16:01:01,661 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > (022d8149808dd3297a8a7275a1fd3d6b) switched from state CREATED to RUNNING. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)