[ 
https://issues.apache.org/jira/browse/AURORA-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496522#comment-15496522
 ] 

John Sirois commented on AURORA-1661:
-------------------------------------

I'd like to leave it as blocked on AURORA-1669 above and only close when that 
closes (the {{-zk_use_curator}} option is removed).

> Scheduler leader failed to re-announce itself after ZK name was changed
> -----------------------------------------------------------------------
>
>                 Key: AURORA-1661
>                 URL: https://issues.apache.org/jira/browse/AURORA-1661
>             Project: Aurora
>          Issue Type: Story
>            Reporter: Dmitriy Shirchenko
>            Assignee: John Sirois
>
> After we renamed our ZK cluster we were in a situation where the Aurora 
> leader did not write to the ZK endpoint. 
> Sequence that we performed:
> - change aurora cluster name in clusters.json
> - restart all schedulers (all good)
> - restart zk and move replication log (lose all state) during renaming: 
> intentional 
> - current Aurora leader fails to re-announce itself but election isn't 
> retriggered so we lose our leader
> - some time later: restart all schedulers, and a new leader is elected (good 
> again)
> Partial logs from the master (sorry, I'm not sure which parts are super 
> relevant).
> {code}
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:10,723:30181(0x7f7586244700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [10.162.18.25:2181] zk retcode=-4, errno=111(Connection refused): 
> server refused to accept the client
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:10,723:30181(0x7f7586a45700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [10.162.28.31:2181] zk retcode=-4, errno=111(Connection refused): 
> server refused to accept the client
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:10,723:30181(0x7f7586244700):ZOO_INFO@check_events@1703: initiated 
> connection to server [10.162.22.24:2181]
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:10,723:30181(0x7f7586a45700):ZOO_INFO@check_events@1703: initiated 
> connection to server [10.162.2.27:2181]
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:10,733:30181(0x7f7586a45700):ZOO_INFO@check_events@1750: session 
> establishment complete on server [10.162.2.27:2181], 
> sessionId=0x28004d50ccb80001, negotiated timeout=4000
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734082 
> 30370 group.cpp:349] Group process (group(1)@10.162.12.31:8083) connected to 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734127 
> 30370 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
> datas) = (0, 0, 0)
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734136 
> 30370 group.cpp:427] Trying to create path '/aurora/replicated-log' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755173 
> 30373 network.hpp:413] ZooKeeper group memberships changed
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755249 
> 30367 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.756778 
> 30364 network.hpp:461] ZooKeeper group PIDs: { 
> log-replica(1)@10.162.14.30:8083 }
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763407 
> 30373 network.hpp:413] ZooKeeper group memberships changed
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763478 
> 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.764437 
> 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.765494 
> 30376 network.hpp:461] ZooKeeper group PIDs: { 
> log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.14.30:8083 }
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:10,782:30181(0x7f7586244700):ZOO_INFO@check_events@1750: session 
> establishment complete on server [10.162.22.24:2181], 
> sessionId=0x26004d8049520002, negotiated timeout=4000
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782429 
> 30376 group.cpp:349] Group process (group(2)@10.162.12.31:8083) connected to 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782444 
> 30376 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
> datas) = (1, 0, 0)
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782447 
> 30376 group.cpp:427] Trying to create path '/aurora/replicated-log' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795754 
> 30377 network.hpp:413] ZooKeeper group memberships changed
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795805 
> 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.796530 
> 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.797600 
> 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000002' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.798283 
> 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000003' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799046 
> 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000004' in 
> ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799967 
> 30372 network.hpp:461] ZooKeeper group PIDs: { 
> log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.9.54:8083, 
> log-replica(1)@10.162.12.31:8083, log-replica(1)@10.162.14.30:8083, 
> log-replica(1)@10.162.29.25:8083 }
> Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 
> [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1032] Opening socket connection to server 
> compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to 
> authenticate using SA
> Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 
> [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:876] Socket connection established to 
> compute39-sjc1.prod.uber.internal/10.162.1.28:2181, initiating session
> Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.842 
> [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1158] Unable to read additional data from server 
> sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
> connecti
> Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 
> [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1032] Opening socket connection to server 
> compute38-sjc1.prod.uber.internal/10.162.22.24:2181. Will not attempt to 
> authenticate using S
> Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 
> [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:876] Socket connection established to 
> compute38-sjc1.prod.uber.internal/10.162.22.24:2181, initiating session
> Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.811 
> [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1158] Unable to read additional data from server 
> sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
> connecti
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:13,460:30181(0x7f7577fff700):ZOO_INFO@check_events@1703: initiated 
> connection to server [10.162.2.27:2181]
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
> 22:59:13,462:30181(0x7f7577fff700):ZOO_INFO@check_events@1750: session 
> establishment complete on server [10.162.2.27:2181], 
> sessionId=0x28004d50ccb8001c, negotiated timeout=10000
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463156 
> 30378 group.cpp:349] Group process (group(3)@10.162.12.31:8083) connected to 
> ZooKeeper
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463179 
> 30378 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
> datas) = (0, 0, 0)
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463186 
> 30378 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466274 
> 30373 detector.cpp:154] Detected a new leader: (id='1')
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466698 
> 30365 group.cpp:700] Trying to get '/mesos/json.info_0000000001' in ZooKeeper
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467295 
> 30360 detector.cpp:479] A new leading master (UPID=master@10.162.29.25:5050) 
> is detected
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467350 
> 30370 sched.cpp:326] New master detected at master@10.162.29.25:5050
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467427 
> 30370 sched.cpp:382] Authenticating with master master@10.162.29.25:5050
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467437 
> 30370 sched.cpp:389] Using default CRAM-MD5 authenticatee
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467504 
> 30360 authenticatee.cpp:121] Creating new client SASL connection
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.753 
> [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1032] Opening socket connection to server 
> compute36-sjc1.prod.uber.internal/10.162.28.31:2181. Will not attempt to 
> authenticate using S
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:13.753 
> [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1162] Session 0x270049a1666d1483 for server null, 
> unexpected error, closing socket connection and attempting reconnect 
> java.net.Connect
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[na:1.8.0_72-internal]
> ...
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[na:1.8.0_72-internal]
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[zookeeper-3.4.8.jar:3.4.8--1]
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) 
> ~[zookeeper-3.4.8.jar:3.4.8--1]
> Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 
> [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1032] Opening socket connection to server 
> compute40-sjc1.prod.uber.internal/10.162.2.27:2181. Will not attempt to 
> authenticate using SA
> Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 
> [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:876] Socket connection established to 
> compute40-sjc1.prod.uber.internal/10.162.2.27:2181, initiating session
> Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.852 
> [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1158] Unable to read additional data from server 
> sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
> connecti
> ...
> Apr 08 22:59:50 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:50.538 
> [RedirectMonitor STARTING-SendThread(compute37-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1158] Unable to read additional data from server 
> sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
> connecti
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 
> [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1032] Opening socket connection to server 
> compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to 
> authenticate using SA
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.273 
> [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 1635ms for sessionid 0x270049a1666d1483
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 
> [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
> ClientCnxn$SendThread:1156] Client session timed out, have not heard from 
> server in 1635ms for sessionid 0x270049a1666d1483, closing socket connection 
> and at
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.374 
> [AsyncProcessor-7, Group$ActiveMembership:370] Temporary error cancelling 
> membership:/aurora/scheduler/member_0000000360 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /aurora/scheduler/member_0000000360
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to