Dmitriy Shirchenko created AURORA-1661:
------------------------------------------

             Summary: Scheduler leader failed to re-announce itself after ZK 
name was changed
                 Key: AURORA-1661
                 URL: https://issues.apache.org/jira/browse/AURORA-1661
             Project: Aurora
          Issue Type: Story
            Reporter: Dmitriy Shirchenko
            Assignee: John Sirois


After we renamed our ZK cluster we were in a situation where the Aurora leader 
did not write to the ZK endpoint. 

Sequence that we performed:
- change aurora cluster name in clusters.json
- restart all schedulers (all good)
- restart zk and move replication log (lose all state) during renaming: 
intentional 
- current Aurora leader fails to re-announce itself but election isn't 
retriggered so we lose our leader
- some time later: restart all schedulers, and a new leader is elected (good 
again)


Partial logs from the master (sorry, I'm not sure which parts are super 
relevant).


{code}
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:10,723:30181(0x7f7586244700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [10.162.18.25:2181] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:10,723:30181(0x7f7586a45700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [10.162.28.31:2181] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:10,723:30181(0x7f7586244700):ZOO_INFO@check_events@1703: initiated 
connection to server [10.162.22.24:2181]
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:10,723:30181(0x7f7586a45700):ZOO_INFO@check_events@1703: initiated 
connection to server [10.162.2.27:2181]
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:10,733:30181(0x7f7586a45700):ZOO_INFO@check_events@1750: session 
establishment complete on server [10.162.2.27:2181], 
sessionId=0x28004d50ccb80001, negotiated timeout=4000
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734082 
30370 group.cpp:349] Group process (group(1)@10.162.12.31:8083) connected to 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734127 
30370 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
datas) = (0, 0, 0)
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734136 
30370 group.cpp:427] Trying to create path '/aurora/replicated-log' in ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755173 
30373 network.hpp:413] ZooKeeper group memberships changed
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755249 
30367 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.756778 
30364 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.162.14.30:8083 
}
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763407 
30373 network.hpp:413] ZooKeeper group memberships changed
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763478 
30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.764437 
30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.765494 
30376 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.14.30:8083 }
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:10,782:30181(0x7f7586244700):ZOO_INFO@check_events@1750: session 
establishment complete on server [10.162.22.24:2181], 
sessionId=0x26004d8049520002, negotiated timeout=4000
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782429 
30376 group.cpp:349] Group process (group(2)@10.162.12.31:8083) connected to 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782444 
30376 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
datas) = (1, 0, 0)
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782447 
30376 group.cpp:427] Trying to create path '/aurora/replicated-log' in ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795754 
30377 network.hpp:413] ZooKeeper group memberships changed
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795805 
30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.796530 
30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.797600 
30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000002' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.798283 
30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000003' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799046 
30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000004' in 
ZooKeeper
Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799967 
30372 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.9.54:8083, 
log-replica(1)@10.162.12.31:8083, log-replica(1)@10.162.14.30:8083, 
log-replica(1)@10.162.29.25:8083 }
Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 
[RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1032] Opening socket connection to server 
compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to 
authenticate using SA
Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 
[RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:876] Socket connection established to 
compute39-sjc1.prod.uber.internal/10.162.1.28:2181, initiating session
Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.842 
[RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1158] Unable to read additional data from server 
sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
connecti
Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 
[RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1032] Opening socket connection to server 
compute38-sjc1.prod.uber.internal/10.162.22.24:2181. Will not attempt to 
authenticate using S
Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 
[RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:876] Socket connection established to 
compute38-sjc1.prod.uber.internal/10.162.22.24:2181, initiating session
Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.811 
[RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1158] Unable to read additional data from server 
sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
connecti
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:13,460:30181(0x7f7577fff700):ZOO_INFO@check_events@1703: initiated 
connection to server [10.162.2.27:2181]
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 
22:59:13,462:30181(0x7f7577fff700):ZOO_INFO@check_events@1750: session 
establishment complete on server [10.162.2.27:2181], 
sessionId=0x28004d50ccb8001c, negotiated timeout=10000
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463156 
30378 group.cpp:349] Group process (group(3)@10.162.12.31:8083) connected to 
ZooKeeper
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463179 
30378 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
datas) = (0, 0, 0)
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463186 
30378 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466274 
30373 detector.cpp:154] Detected a new leader: (id='1')
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466698 
30365 group.cpp:700] Trying to get '/mesos/json.info_0000000001' in ZooKeeper
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467295 
30360 detector.cpp:479] A new leading master ([email protected]:5050) is 
detected
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467350 
30370 sched.cpp:326] New master detected at [email protected]:5050
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467427 
30370 sched.cpp:382] Authenticating with master [email protected]:5050
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467437 
30370 sched.cpp:389] Using default CRAM-MD5 authenticatee
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467504 
30360 authenticatee.cpp:121] Creating new client SASL connection
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.753 
[RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1032] Opening socket connection to server 
compute36-sjc1.prod.uber.internal/10.162.28.31:2181. Will not attempt to 
authenticate using S
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:13.753 
[RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1162] Session 0x270049a1666d1483 for server null, 
unexpected error, closing socket connection and attempting reconnect 
java.net.Connect
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_72-internal]
...
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
~[na:1.8.0_72-internal]
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 ~[zookeeper-3.4.8.jar:3.4.8--1]
Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) 
~[zookeeper-3.4.8.jar:3.4.8--1]
Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 
[RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1032] Opening socket connection to server 
compute40-sjc1.prod.uber.internal/10.162.2.27:2181. Will not attempt to 
authenticate using SA
Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 
[RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:876] Socket connection established to 
compute40-sjc1.prod.uber.internal/10.162.2.27:2181, initiating session
Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.852 
[RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1158] Unable to read additional data from server 
sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
connecti
...
Apr 08 22:59:50 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:50.538 
[RedirectMonitor STARTING-SendThread(compute37-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1158] Unable to read additional data from server 
sessionid 0x270049a1666d1483, likely server has closed socket, closing socket 
connecti
Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 
[RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1032] Opening socket connection to server 
compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to 
authenticate using SA
Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.273 
[RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
server in 1635ms for sessionid 0x270049a1666d1483
Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 
[RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), 
ClientCnxn$SendThread:1156] Client session timed out, have not heard from 
server in 1635ms for sessionid 0x270049a1666d1483, closing socket connection 
and at
Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.374 
[AsyncProcessor-7, Group$ActiveMembership:370] Temporary error cancelling 
membership:/aurora/scheduler/member_0000000360 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /aurora/scheduler/member_0000000360
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to