[ 
https://issues.apache.org/jira/browse/FLINK-6063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927786#comment-15927786
 ] 

Razvan commented on FLINK-6063:
-------------------------------

Hi Till, thanks for replying, sure I can attach the logs you mentioned 


Cluster configuration: Standalone cluster with JobManager at /1.2.3.4:44307
Using address 1.2.3.4:44307 to connect to JobManager.
JobManager web interface address http://1.2.3.4:8081
Starting execution of program
Submitting job with JobID: 2c64b1126f327261b0c43f33f3cf43ee. Waiting for job 
completion.
Connected to JobManager at 
Actor[akka.tcp://flink@1.2.3.4:44307/user/jobmanager#2001981191]
03/16/2017 09:40:10     Job execution switched to status RUNNING.
03/16/2017 09:40:10     Source: Custom Source -> Flat Map(1/1) switched to 
SCHEDULED 
03/16/2017 09:40:10     Source: Custom Source -> Flat Map(1/1) switched to 
DEPLOYING 
03/16/2017 09:40:10     Flat Map(1/1) switched to SCHEDULED 
03/16/2017 09:40:10     Flat Map(1/1) switched to DEPLOYING 
03/16/2017 09:40:10     Flat Map(1/1) switched to RUNNING 
03/16/2017 09:40:10     Source: Custom Source -> Flat Map(1/1) switched to 
RUNNING 
New JobManager elected. Connecting to null
Connected to JobManager at 
Actor[akka.tcp://flink@1.2.3.5:43828/user/jobmanager#1400235434]


Killed JobManager

2017-03-16 09:58:14,953 INFO  org.apache.zookeeper.server.NIOServerCnxnFactory  
            - Accepted socket connection from /[Client 1 IP here]:40858
2017-03-16 09:58:14,953 INFO  org.apache.zookeeper.server.ZooKeeperServer       
            - Client attempting to establish new session at /[Client 1 IP 
here]:40858
2017-03-16 09:58:14,957 INFO  org.apache.zookeeper.server.ZooKeeperServer       
            - Established session 0x35ad68d8b4d0004 with negotiated timeout 
40000 for client /[Client 1 IP here]:40858
2017-03-16 09:58:15,523 INFO  org.apache.zookeeper.server.NIOServerCnxnFactory  
            - Accepted socket connection from /[Client 2 IP here]:40276
2017-03-16 09:58:15,528 INFO  org.apache.zookeeper.server.ZooKeeperServer       
            - Client attempting to establish new session at /[Client 2 IP 
here]:40276
2017-03-16 09:58:15,531 INFO  org.apache.zookeeper.server.ZooKeeperServer       
            - Established session 0x35ad68d8b4d0005 with negotiated timeout 
40000 for client /[Client 2 IP here]:40276
2017-03-16 10:10:25,118 WARN  org.apache.zookeeper.server.NIOServerCnxn         
            - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x35ad68d8b4d0002, likely client has closed socket
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2017-03-16 10:10:25,120 INFO  org.apache.zookeeper.server.NIOServerCnxn         
            - Closed socket connection for client /1.2.3.4:47872 which had 
sessionid 0x35ad68d8b4d0002



New Leader

2017-03-16 09:58:17,319 INFO  org.apache.zookeeper.server.NIOServerCnxnFactory  
            - Accepted socket connection from /1.2.3.5:53748
2017-03-16 09:58:17,320 INFO  org.apache.zookeeper.server.ZooKeeperServer       
            - Client attempting to establish new session at /1.2.3.5:53748
2017-03-16 09:58:17,322 INFO  org.apache.zookeeper.server.ZooKeeperServer       
            - Established session 0x15ad68d898c0006 with negotiated timeout 
40000 for client /1.2.3.5:53748
2017-03-16 09:58:18,336 INFO  org.apache.zookeeper.server.NIOServerCnxn         
            - Closed socket connection for client /1.2.3.5:53748 which had 
sessionid 0x15ad68d898c0006
2017-03-16 10:10:23,881 WARN  org.apache.zookeeper.server.NIOServerCnxn         
            - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x15ad68d898c0001, likely client has closed socket
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2017-03-16 10:10:23,885 INFO  org.apache.zookeeper.server.NIOServerCnxn         
            - Closed socket connection for client /1.2.3.4:45752 which had 
sessionid 0x15ad68d898c0001
2017-03-16 10:10:23,885 WARN  org.apache.zookeeper.server.NIOServerCnxn         
            - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x15ad68d898c0002, likely client has closed socket
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2017-03-16 10:10:23,887 INFO  org.apache.zookeeper.server.NIOServerCnxn         
            - Closed socket connection for client /1.2.3.4:45754 which had 
sessionid 0x15ad68d898c0002


TaskManager

2017-03-16 09:58:14,308 INFO  org.apache.zookeeper.ClientCnxn                   
            - Session establishment complete on server 1.2.3.4/1.2.3.4:2182, 
sessionid = 0x35ad68d8b4d0005, negotiated timeout = 40000
2017-03-16 09:58:14,309 INFO  
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED
2017-03-16 09:58:14,321 INFO  org.apache.flink.runtime.metrics.MetricRegistry   
            - No metrics reporter configured, no metrics will be 
exposed/reported.
2017-03-16 09:58:14,337 INFO  org.apache.flink.runtime.filecache.FileCache      
            - User file cache uses directory 
/tmp/flink-dist-cache-3e54917c-076e-4f06-ac7a-2eac0067f724
2017-03-16 09:58:14,351 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Starting TaskManager actor at 
akka://flink/user/taskmanager#1061025726.
2017-03-16 09:58:14,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - TaskManager data connection information: 
ResourceID{resourceId='c1d76fc91a0632f9863d187f70f32605'} @ ip-client1 
(dataPort=39068)
2017-03-16 09:58:14,357 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - TaskManager has 1 task slot(s).
2017-03-16 09:58:14,359 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Memory usage stats: [HEAP: 73/1024/1024 MB, NON HEAP: 33/34/-1 MB 
(used/committed/max)]
2017-03-16 09:58:14,364 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Starting ZooKeeperLeaderRetrievalService.
2017-03-16 09:58:14,375 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Trying to register at JobManager 
akka.tcp://flink@1.2.3.4:45164/user/jobmanager (attempt 1, timeout: 500 
milliseconds)
2017-03-16 09:58:14,592 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Successful registration at JobManager 
(akka.tcp://flink@1.2.3.4:45164/user/jobmanager), starting network stack and 
library cache.
2017-03-16 09:58:14,595 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Determined BLOB server address to be /1.2.3.4:45689. Starting 
BLOB cache.
2017-03-16 09:58:14,600 INFO  org.apache.flink.runtime.blob.BlobCache           
            - Created BLOB cache storage directory 
/tmp/blobStore-f82879e0-47ad-4616-9000-9753ec787f49
2017-03-16 10:10:23,910 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2017-03-16 10:10:29,659 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:10:39,656 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:10:49,655 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:10:59,657 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:04,020 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - TaskManager akka://flink/user/taskmanager disconnects from 
JobManager akka.tcp://flink@1.2.3.4:45164/user/jobmanager: Old JobManager lost 
its leadership.
2017-03-16 10:11:04,020 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Disassociating from JobManager
2017-03-16 10:11:04,025 INFO  org.apache.flink.runtime.blob.BlobCache           
            - Shutting down BlobCache
2017-03-16 10:11:04,042 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Trying to register at JobManager 
akka.tcp://flink@1.2.3.5:34987/user/jobmanager (attempt 1, timeout: 500 
milliseconds)
2017-03-16 10:11:04,174 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Successful registration at JobManager 
(akka.tcp://flink@1.2.3.5:34987/user/jobmanager), starting network stack and 
library cache.
2017-03-16 10:11:04,174 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Determined BLOB server address to be /1.2.3.5:42030. Starting 
BLOB cache.
2017-03-16 10:11:04,175 INFO  org.apache.flink.runtime.blob.BlobCache           
            - Created BLOB cache storage directory 
/tmp/blobStore-92bf7fe1-bab0-498c-90bf-6ec44ec6cb1e
2017-03-16 10:11:04,675 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:09,695 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:14,704 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:19,726 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:24,746 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:29,753 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:34,772 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:39,785 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:44,799 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:49,816 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:54,824 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:11:59,835 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:04,845 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:09,854 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:14,863 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:19,874 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:24,886 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:29,895 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:34,905 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:39,918 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:44,933 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:49,948 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:54,964 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:12:59,974 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:04,984 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:09,996 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:15,003 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:20,026 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:25,033 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:30,043 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:35,055 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:40,067 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:45,083 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:50,095 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:13:55,104 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:14:00,113 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system [akka.tcp://flink@1.2.3.4:45164] 
has failed, address is now gated for [5000] ms. Reason: [Association failed 
with [akka.tcp://flink@1.2.3.4:45164]] Caused by: [Connection refused: 
/1.2.3.4:45164]
2017-03-16 10:14:05,122 ERROR Remoting                                          
            - Association to [akka.tcp://flink@1.2.3.4:45164] with UID 
[588297160] irrecoverably failed. Quarantining address.
java.util.concurrent.TimeoutException: Delivery of system messages timed out 
and they were dropped.
        at 
akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:336)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at 
akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:189)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


> HA Configuration doesn't work with Flink 1.2
> --------------------------------------------
>
>                 Key: FLINK-6063
>                 URL: https://issues.apache.org/jira/browse/FLINK-6063
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.2.0
>            Reporter: Razvan
>            Priority: Critical
>
>  I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 
> TaskManagers. I start the Zookeeper Quorum from JobManager1, I get 
> confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink 
> job on this JobManager1.   
>  
>  The flink-conf.yaml is the same on all 5 VMs (also everything else related 
> to flink because I copied the folder across all VMs as suggested in 
> tutorials) this means jobmanager.rpc.address: points to JobManager1 
> everywhere.
> If I turn off the VM running JobManager1 I would expect Zookeeper to say one 
> of the remaining JobManagers is the leader and the TaskManagers should 
> reconnect to it. Instead a new leader is elected but the slaves keep 
> connecting to the old master
>     2017-03-15 10:28:28,655 INFO  org.apache.flink.core.fs.FileSystem         
>                   - Ensuring all FileSystem streams are closed for Async 
> calls on Source: Custom Source -> Flat Map (1/1)
>     2017-03-15 10:28:38,534 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Disassociated] 
>     2017-03-15 10:28:46,606 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:28:52,431 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:29:02,435 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:29:10,489 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager 
> akka://flink/user/taskmanager disconnects from JobManager 
> akka.tcp://flink@1.2.3.4:44779/user/jobmanager: Old JobManager lost its 
> leadership.
>     2017-03-15 10:29:10,490 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling 
> all computations and discarding all cached data.
>     2017-03-15 10:29:10,491 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Attempting to fail task externally Source: Custom Source 
> -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
>     2017-03-15 10:29:10,491 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Source: Custom Source -> Flat Map (1/1) 
> (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED.
>     java.lang.Exception: TaskManager akka://flink/user/taskmanager 
> disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: 
> Old JobManager lost its leadership.
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
>       at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>       at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>       at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>       at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>       at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>       at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>     2017-03-15 10:29:10,512 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Triggering cancellation of task code Source: Custom 
> Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
>     2017-03-15 10:29:10,515 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Attempting to fail task externally Flat Map (1/1) 
> (dd555e0437867c3180a1ecaf0a9f4d04).
>     2017-03-15 10:29:10,515 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) 
> switched from RUNNING to FAILED.
>     java.lang.Exception: TaskManager akka://flink/user/taskmanager 
> disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: 
> Old JobManager lost its leadership.
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
>       at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>       at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>       at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>       at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>       at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>       at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>       at 
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>     2017-03-15 10:29:10,516 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Triggering cancellation of task code Flat Map (1/1) 
> (dd555e0437867c3180a1ecaf0a9f4d04).
>     2017-03-15 10:29:10,516 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager              - 
> Disassociating from JobManager
>     2017-03-15 10:29:10,525 INFO  org.apache.flink.runtime.blob.BlobCache     
>                   - Shutting down BlobCache
>     2017-03-15 10:29:10,542 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:29:10,546 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Freeing task resources for Source: Custom Source -> Flat 
> Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
>     2017-03-15 10:29:10,548 INFO  org.apache.flink.runtime.taskmanager.Task   
>                   - Freeing task resources for Flat Map (1/1) 
> (dd555e0437867c3180a1ecaf0a9f4d04).
>     2017-03-15 10:29:10,551 INFO  org.apache.flink.core.fs.FileSystem         
>                   - Ensuring all FileSystem streams are closed for Flat Map 
> (1/1)
>     2017-03-15 10:29:10,552 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager              - Trying to 
> register at JobManager akka.tcp://flink@1.2.3.5:43893/user/jobmanager 
> (attempt 1, timeout: 500 milliseconds)
>     2017-03-15 10:29:10,567 INFO  org.apache.flink.core.fs.FileSystem         
>                   - Ensuring all FileSystem streams are closed for Source: 
> Custom Source -> Flat Map (1/1)
>     2017-03-15 10:29:10,632 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager              - Successful 
> registration at JobManager (akka.tcp://flink@1.2.3.5:43893/user/jobmanager), 
> starting network stack and library cache.
>     2017-03-15 10:29:10,633 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager              - Determined 
> BLOB server address to be /1.2.3.5:42830. Starting BLOB cache.
>     2017-03-15 10:29:10,633 INFO  org.apache.flink.runtime.blob.BlobCache     
>                   - Created BLOB cache storage directory 
> /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934
>     2017-03-15 10:29:15,551 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:29:20,571 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:29:25,582 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>     2017-03-15 10:29:30,592 WARN  akka.remote.ReliableDeliverySupervisor      
>                   - Association with remote system 
> [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] 
> ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused 
> by: [Connection refused: /1.2.3.4:44779]
>   I modified the original IPs to 1.2.3.4 for JobManager1 and 1.2.3.5 for 
> JobManager2 for confidentiality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to