[ https://issues.apache.org/jira/browse/FLINK-6063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927786#comment-15927786 ]
Razvan edited comment on FLINK-6063 at 3/16/17 10:34 AM: --------------------------------------------------------- Hi Till, thanks for replying, sure I can attach the logs you mentioned Started Zookeeper then cluster @~9:58, killed leader JobManager @~10:10 TaskManagers stop retrying @~10:14 Job log: Cluster configuration: Standalone cluster with JobManager at /1.2.3.4:45164 Using address 1.2.3.4:45164 to connect to JobManager. JobManager web interface address http://1.2.3.4:8081 Starting execution of program Submitting job with JobID: a7c96ad4345c1f07fe666bc5fd78256f. Waiting for job completion. Connected to JobManager at Actor[akka.tcp://flink@1.2.3.4:45164/user/jobmanager#-1418996734] 03/16/2017 09:58:23 Job execution switched to status RUNNING. 03/16/2017 09:58:23 Source: Custom Source -> Flat Map(1/1) switched to SCHEDULED 03/16/2017 09:58:23 Source: Custom Source -> Flat Map(1/1) switched to DEPLOYING 03/16/2017 09:58:23 Flat Map(1/1) switched to SCHEDULED 03/16/2017 09:58:23 Flat Map(1/1) switched to DEPLOYING 03/16/2017 09:58:23 Flat Map(1/1) switched to RUNNING 03/16/2017 09:58:23 Source: Custom Source -> Flat Map(1/1) switched to RUNNING New JobManager elected. Connecting to null Connected to JobManager at Actor[akka.tcp://flink@1.2.3.5:34987/user/jobmanager#-27372488] Killed JobManager log: 2017-03-16 09:58:14,953 INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /[Client 1 IP here]:40858 2017-03-16 09:58:14,953 INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /[Client 1 IP here]:40858 2017-03-16 09:58:14,957 INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x35ad68d8b4d0004 with negotiated timeout 40000 for client /[Client 1 IP here]:40858 2017-03-16 09:58:15,523 INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /[Client 2 IP here]:40276 2017-03-16 09:58:15,528 INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /[Client 2 IP here]:40276 2017-03-16 09:58:15,531 INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x35ad68d8b4d0005 with negotiated timeout 40000 for client /[Client 2 IP here]:40276 2017-03-16 10:10:25,118 WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x35ad68d8b4d0002, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2017-03-16 10:10:25,120 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.4:47872 which had sessionid 0x35ad68d8b4d0002 New Leader log: 2017-03-16 09:58:17,319 INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /1.2.3.5:53748 2017-03-16 09:58:17,320 INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /1.2.3.5:53748 2017-03-16 09:58:17,322 INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x15ad68d898c0006 with negotiated timeout 40000 for client /1.2.3.5:53748 2017-03-16 09:58:18,336 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.5:53748 which had sessionid 0x15ad68d898c0006 2017-03-16 10:10:23,881 WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x15ad68d898c0001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2017-03-16 10:10:23,885 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.4:45752 which had sessionid 0x15ad68d898c0001 2017-03-16 10:10:23,885 WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x15ad68d898c0002, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2017-03-16 10:10:23,887 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.4:45754 which had sessionid 0x15ad68d898c0002 TaskManager log: 2017-03-16 09:58:14,308 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 1.2.3.4/1.2.3.4:2182, sessionid = 0x35ad68d8b4d0005, negotiated timeout = 40000 2017-03-16 09:58:14,309 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED 2017-03-16 09:58:14,321 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported. 2017-03-16 09:58:14,337 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-3e54917c-076e-4f06-ac7a-2eac0067f724 2017-03-16 09:58:14,351 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1061025726. 2017-03-16 09:58:14,356 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: ResourceID{resourceId='c1d76fc91a0632f9863d187f70f32605'} @ ip-client1 (dataPort=39068) 2017-03-16 09:58:14,357 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 1 task slot(s). 2017-03-16 09:58:14,359 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 73/1024/1024 MB, NON HEAP: 33/34/-1 MB (used/committed/max)] 2017-03-16 09:58:14,364 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. 2017-03-16 09:58:14,375 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@1.2.3.4:45164/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2017-03-16 09:58:14,592 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@1.2.3.4:45164/user/jobmanager), starting network stack and library cache. 2017-03-16 09:58:14,595 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.4:45689. Starting BLOB cache. 2017-03-16 09:58:14,600 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-f82879e0-47ad-4616-9000-9753ec787f49 2017-03-16 10:10:23,910 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2017-03-16 10:10:29,659 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:10:39,656 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:10:49,655 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:10:59,657 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:04,020 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:45164/user/jobmanager: Old JobManager lost its leadership. 2017-03-16 10:11:04,020 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager 2017-03-16 10:11:04,025 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache 2017-03-16 10:11:04,042 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@1.2.3.5:34987/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2017-03-16 10:11:04,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@1.2.3.5:34987/user/jobmanager), starting network stack and library cache. 2017-03-16 10:11:04,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42030. Starting BLOB cache. 2017-03-16 10:11:04,175 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-92bf7fe1-bab0-498c-90bf-6ec44ec6cb1e 2017-03-16 10:11:04,675 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:09,695 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:14,704 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:19,726 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:24,746 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:29,753 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:34,772 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:39,785 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:44,799 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:49,816 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:54,824 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:59,835 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:04,845 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:09,854 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:14,863 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:19,874 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:24,886 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:29,895 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:34,905 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:39,918 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:44,933 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:49,948 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:54,964 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:59,974 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:04,984 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:09,996 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:15,003 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:20,026 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:25,033 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:30,043 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:35,055 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:40,067 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:45,083 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:50,095 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:55,104 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:14:00,113 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:14:05,122 ERROR Remoting - Association to [akka.tcp://flink@1.2.3.4:45164] with UID [588297160] irrecoverably failed. Quarantining address. java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped. at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:336) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:189) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) was (Author: razvan): Hi Till, thanks for replying, sure I can attach the logs you mentioned Started Zookeeper then cluster @~9:58, killed leader JobManager @~10:10 Whole cluster died @~10:14 Job log: Cluster configuration: Standalone cluster with JobManager at /1.2.3.4:45164 Using address 1.2.3.4:45164 to connect to JobManager. JobManager web interface address http://1.2.3.4:8081 Starting execution of program Submitting job with JobID: a7c96ad4345c1f07fe666bc5fd78256f. Waiting for job completion. Connected to JobManager at Actor[akka.tcp://flink@1.2.3.4:45164/user/jobmanager#-1418996734] 03/16/2017 09:58:23 Job execution switched to status RUNNING. 03/16/2017 09:58:23 Source: Custom Source -> Flat Map(1/1) switched to SCHEDULED 03/16/2017 09:58:23 Source: Custom Source -> Flat Map(1/1) switched to DEPLOYING 03/16/2017 09:58:23 Flat Map(1/1) switched to SCHEDULED 03/16/2017 09:58:23 Flat Map(1/1) switched to DEPLOYING 03/16/2017 09:58:23 Flat Map(1/1) switched to RUNNING 03/16/2017 09:58:23 Source: Custom Source -> Flat Map(1/1) switched to RUNNING New JobManager elected. Connecting to null Connected to JobManager at Actor[akka.tcp://flink@1.2.3.5:34987/user/jobmanager#-27372488] Killed JobManager log: 2017-03-16 09:58:14,953 INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /[Client 1 IP here]:40858 2017-03-16 09:58:14,953 INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /[Client 1 IP here]:40858 2017-03-16 09:58:14,957 INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x35ad68d8b4d0004 with negotiated timeout 40000 for client /[Client 1 IP here]:40858 2017-03-16 09:58:15,523 INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /[Client 2 IP here]:40276 2017-03-16 09:58:15,528 INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /[Client 2 IP here]:40276 2017-03-16 09:58:15,531 INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x35ad68d8b4d0005 with negotiated timeout 40000 for client /[Client 2 IP here]:40276 2017-03-16 10:10:25,118 WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x35ad68d8b4d0002, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2017-03-16 10:10:25,120 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.4:47872 which had sessionid 0x35ad68d8b4d0002 New Leader log: 2017-03-16 09:58:17,319 INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /1.2.3.5:53748 2017-03-16 09:58:17,320 INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /1.2.3.5:53748 2017-03-16 09:58:17,322 INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x15ad68d898c0006 with negotiated timeout 40000 for client /1.2.3.5:53748 2017-03-16 09:58:18,336 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.5:53748 which had sessionid 0x15ad68d898c0006 2017-03-16 10:10:23,881 WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x15ad68d898c0001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2017-03-16 10:10:23,885 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.4:45752 which had sessionid 0x15ad68d898c0001 2017-03-16 10:10:23,885 WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x15ad68d898c0002, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2017-03-16 10:10:23,887 INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /1.2.3.4:45754 which had sessionid 0x15ad68d898c0002 TaskManager log: 2017-03-16 09:58:14,308 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 1.2.3.4/1.2.3.4:2182, sessionid = 0x35ad68d8b4d0005, negotiated timeout = 40000 2017-03-16 09:58:14,309 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED 2017-03-16 09:58:14,321 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported. 2017-03-16 09:58:14,337 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-3e54917c-076e-4f06-ac7a-2eac0067f724 2017-03-16 09:58:14,351 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1061025726. 2017-03-16 09:58:14,356 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: ResourceID{resourceId='c1d76fc91a0632f9863d187f70f32605'} @ ip-client1 (dataPort=39068) 2017-03-16 09:58:14,357 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 1 task slot(s). 2017-03-16 09:58:14,359 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 73/1024/1024 MB, NON HEAP: 33/34/-1 MB (used/committed/max)] 2017-03-16 09:58:14,364 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. 2017-03-16 09:58:14,375 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@1.2.3.4:45164/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2017-03-16 09:58:14,592 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@1.2.3.4:45164/user/jobmanager), starting network stack and library cache. 2017-03-16 09:58:14,595 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.4:45689. Starting BLOB cache. 2017-03-16 09:58:14,600 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-f82879e0-47ad-4616-9000-9753ec787f49 2017-03-16 10:10:23,910 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2017-03-16 10:10:29,659 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:10:39,656 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:10:49,655 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:10:59,657 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:04,020 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:45164/user/jobmanager: Old JobManager lost its leadership. 2017-03-16 10:11:04,020 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager 2017-03-16 10:11:04,025 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache 2017-03-16 10:11:04,042 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@1.2.3.5:34987/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2017-03-16 10:11:04,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@1.2.3.5:34987/user/jobmanager), starting network stack and library cache. 2017-03-16 10:11:04,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42030. Starting BLOB cache. 2017-03-16 10:11:04,175 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-92bf7fe1-bab0-498c-90bf-6ec44ec6cb1e 2017-03-16 10:11:04,675 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:09,695 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:14,704 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:19,726 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:24,746 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:29,753 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:34,772 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:39,785 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:44,799 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:49,816 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:54,824 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:11:59,835 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:04,845 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:09,854 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:14,863 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:19,874 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:24,886 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:29,895 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:34,905 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:39,918 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:44,933 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:49,948 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:54,964 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:12:59,974 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:04,984 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:09,996 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:15,003 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:20,026 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:25,033 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:30,043 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:35,055 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:40,067 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:45,083 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:50,095 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:13:55,104 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:14:00,113 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@1.2.3.4:45164] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: /1.2.3.4:45164] 2017-03-16 10:14:05,122 ERROR Remoting - Association to [akka.tcp://flink@1.2.3.4:45164] with UID [588297160] irrecoverably failed. Quarantining address. java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped. at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:336) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:189) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > HA Configuration doesn't work with Flink 1.2 > -------------------------------------------- > > Key: FLINK-6063 > URL: https://issues.apache.org/jira/browse/FLINK-6063 > Project: Flink > Issue Type: Bug > Components: JobManager > Affects Versions: 1.2.0 > Reporter: Razvan > Priority: Critical > > I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 > TaskManagers. I start the Zookeeper Quorum from JobManager1, I get > confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink > job on this JobManager1. > > The flink-conf.yaml is the same on all 5 VMs (also everything else related > to flink because I copied the folder across all VMs as suggested in > tutorials) this means jobmanager.rpc.address: points to JobManager1 > everywhere. > If I turn off the VM running JobManager1 I would expect Zookeeper to say one > of the remaining JobManagers is the leader and the TaskManagers should > reconnect to it. Instead a new leader is elected but the slaves keep > connecting to the old master > 2017-03-15 10:28:28,655 INFO org.apache.flink.core.fs.FileSystem > - Ensuring all FileSystem streams are closed for Async > calls on Source: Custom Source -> Flat Map (1/1) > 2017-03-15 10:28:38,534 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Disassociated] > 2017-03-15 10:28:46,606 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:28:52,431 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:29:02,435 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:29:10,489 INFO > org.apache.flink.runtime.taskmanager.TaskManager - TaskManager > akka://flink/user/taskmanager disconnects from JobManager > akka.tcp://flink@1.2.3.4:44779/user/jobmanager: Old JobManager lost its > leadership. > 2017-03-15 10:29:10,490 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Cancelling > all computations and discarding all cached data. > 2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to fail task externally Source: Custom Source > -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223). > 2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task > - Source: Custom Source -> Flat Map (1/1) > (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED. > java.lang.Exception: TaskManager akka://flink/user/taskmanager > disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: > Old JobManager lost its leadership. > at > org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074) > at > org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426) > at > org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:467) > at > org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2017-03-15 10:29:10,512 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code Source: Custom > Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223). > 2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to fail task externally Flat Map (1/1) > (dd555e0437867c3180a1ecaf0a9f4d04). > 2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task > - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) > switched from RUNNING to FAILED. > java.lang.Exception: TaskManager akka://flink/user/taskmanager > disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: > Old JobManager lost its leadership. > at > org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074) > at > org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426) > at > org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:467) > at > org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code Flat Map (1/1) > (dd555e0437867c3180a1ecaf0a9f4d04). > 2017-03-15 10:29:10,516 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > Disassociating from JobManager > 2017-03-15 10:29:10,525 INFO org.apache.flink.runtime.blob.BlobCache > - Shutting down BlobCache > 2017-03-15 10:29:10,542 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:29:10,546 INFO org.apache.flink.runtime.taskmanager.Task > - Freeing task resources for Source: Custom Source -> Flat > Map (1/1) (75fd495cc6acfd72fbe957e60e513223). > 2017-03-15 10:29:10,548 INFO org.apache.flink.runtime.taskmanager.Task > - Freeing task resources for Flat Map (1/1) > (dd555e0437867c3180a1ecaf0a9f4d04). > 2017-03-15 10:29:10,551 INFO org.apache.flink.core.fs.FileSystem > - Ensuring all FileSystem streams are closed for Flat Map > (1/1) > 2017-03-15 10:29:10,552 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Trying to > register at JobManager akka.tcp://flink@1.2.3.5:43893/user/jobmanager > (attempt 1, timeout: 500 milliseconds) > 2017-03-15 10:29:10,567 INFO org.apache.flink.core.fs.FileSystem > - Ensuring all FileSystem streams are closed for Source: > Custom Source -> Flat Map (1/1) > 2017-03-15 10:29:10,632 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Successful > registration at JobManager (akka.tcp://flink@1.2.3.5:43893/user/jobmanager), > starting network stack and library cache. > 2017-03-15 10:29:10,633 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Determined > BLOB server address to be /1.2.3.5:42830. Starting BLOB cache. > 2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.blob.BlobCache > - Created BLOB cache storage directory > /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934 > 2017-03-15 10:29:15,551 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:29:20,571 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:29:25,582 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > 2017-03-15 10:29:30,592 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] > ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused > by: [Connection refused: /1.2.3.4:44779] > I modified the original IPs to 1.2.3.4 for JobManager1 and 1.2.3.5 for > JobManager2 for confidentiality. -- This message was sent by Atlassian JIRA (v6.3.15#6346)