[ https://issues.apache.org/jira/browse/AURORA-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496532#comment-15496532 ]
John Sirois commented on AURORA-1661: ------------------------------------- That's silly - you're right. > Scheduler leader failed to re-announce itself after ZK name was changed > ----------------------------------------------------------------------- > > Key: AURORA-1661 > URL: https://issues.apache.org/jira/browse/AURORA-1661 > Project: Aurora > Issue Type: Story > Reporter: Dmitriy Shirchenko > Assignee: John Sirois > > After we renamed our ZK cluster we were in a situation where the Aurora > leader did not write to the ZK endpoint. > Sequence that we performed: > - change aurora cluster name in clusters.json > - restart all schedulers (all good) > - restart zk and move replication log (lose all state) during renaming: > intentional > - current Aurora leader fails to re-announce itself but election isn't > retriggered so we lose our leader > - some time later: restart all schedulers, and a new leader is elected (good > again) > Partial logs from the master (sorry, I'm not sure which parts are super > relevant). > {code} > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:10,723:30181(0x7f7586244700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [10.162.18.25:2181] zk retcode=-4, errno=111(Connection refused): > server refused to accept the client > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:10,723:30181(0x7f7586a45700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [10.162.28.31:2181] zk retcode=-4, errno=111(Connection refused): > server refused to accept the client > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:10,723:30181(0x7f7586244700):ZOO_INFO@check_events@1703: initiated > connection to server [10.162.22.24:2181] > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:10,723:30181(0x7f7586a45700):ZOO_INFO@check_events@1703: initiated > connection to server [10.162.2.27:2181] > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:10,733:30181(0x7f7586a45700):ZOO_INFO@check_events@1750: session > establishment complete on server [10.162.2.27:2181], > sessionId=0x28004d50ccb80001, negotiated timeout=4000 > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734082 > 30370 group.cpp:349] Group process (group(1)@10.162.12.31:8083) connected to > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734127 > 30370 group.cpp:831] Syncing group operations: queue size (joins, cancels, > datas) = (0, 0, 0) > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734136 > 30370 group.cpp:427] Trying to create path '/aurora/replicated-log' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755173 > 30373 network.hpp:413] ZooKeeper group memberships changed > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755249 > 30367 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.756778 > 30364 network.hpp:461] ZooKeeper group PIDs: { > log-replica(1)@10.162.14.30:8083 } > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763407 > 30373 network.hpp:413] ZooKeeper group memberships changed > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763478 > 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.764437 > 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.765494 > 30376 network.hpp:461] ZooKeeper group PIDs: { > log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.14.30:8083 } > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:10,782:30181(0x7f7586244700):ZOO_INFO@check_events@1750: session > establishment complete on server [10.162.22.24:2181], > sessionId=0x26004d8049520002, negotiated timeout=4000 > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782429 > 30376 group.cpp:349] Group process (group(2)@10.162.12.31:8083) connected to > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782444 > 30376 group.cpp:831] Syncing group operations: queue size (joins, cancels, > datas) = (1, 0, 0) > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782447 > 30376 group.cpp:427] Trying to create path '/aurora/replicated-log' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795754 > 30377 network.hpp:413] ZooKeeper group memberships changed > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795805 > 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.796530 > 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.797600 > 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000002' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.798283 > 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000003' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799046 > 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000004' in > ZooKeeper > Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799967 > 30372 network.hpp:461] ZooKeeper group PIDs: { > log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.9.54:8083, > log-replica(1)@10.162.12.31:8083, log-replica(1)@10.162.14.30:8083, > log-replica(1)@10.162.29.25:8083 } > Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 > [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1032] Opening socket connection to server > compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to > authenticate using SA > Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 > [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:876] Socket connection established to > compute39-sjc1.prod.uber.internal/10.162.1.28:2181, initiating session > Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.842 > [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1158] Unable to read additional data from server > sessionid 0x270049a1666d1483, likely server has closed socket, closing socket > connecti > Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 > [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1032] Opening socket connection to server > compute38-sjc1.prod.uber.internal/10.162.22.24:2181. Will not attempt to > authenticate using S > Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 > [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:876] Socket connection established to > compute38-sjc1.prod.uber.internal/10.162.22.24:2181, initiating session > Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.811 > [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1158] Unable to read additional data from server > sessionid 0x270049a1666d1483, likely server has closed socket, closing socket > connecti > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:13,460:30181(0x7f7577fff700):ZOO_INFO@check_events@1703: initiated > connection to server [10.162.2.27:2181] > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 > 22:59:13,462:30181(0x7f7577fff700):ZOO_INFO@check_events@1750: session > establishment complete on server [10.162.2.27:2181], > sessionId=0x28004d50ccb8001c, negotiated timeout=10000 > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463156 > 30378 group.cpp:349] Group process (group(3)@10.162.12.31:8083) connected to > ZooKeeper > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463179 > 30378 group.cpp:831] Syncing group operations: queue size (joins, cancels, > datas) = (0, 0, 0) > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463186 > 30378 group.cpp:427] Trying to create path '/mesos' in ZooKeeper > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466274 > 30373 detector.cpp:154] Detected a new leader: (id='1') > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466698 > 30365 group.cpp:700] Trying to get '/mesos/json.info_0000000001' in ZooKeeper > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467295 > 30360 detector.cpp:479] A new leading master (UPID=master@10.162.29.25:5050) > is detected > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467350 > 30370 sched.cpp:326] New master detected at master@10.162.29.25:5050 > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467427 > 30370 sched.cpp:382] Authenticating with master master@10.162.29.25:5050 > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467437 > 30370 sched.cpp:389] Using default CRAM-MD5 authenticatee > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467504 > 30360 authenticatee.cpp:121] Creating new client SASL connection > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.753 > [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1032] Opening socket connection to server > compute36-sjc1.prod.uber.internal/10.162.28.31:2181. Will not attempt to > authenticate using S > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:13.753 > [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1162] Session 0x270049a1666d1483 for server null, > unexpected error, closing socket connection and attempting reconnect > java.net.Connect > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at > sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[na:1.8.0_72-internal] > ... > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[na:1.8.0_72-internal] > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[zookeeper-3.4.8.jar:3.4.8--1] > Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > ~[zookeeper-3.4.8.jar:3.4.8--1] > Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 > [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1032] Opening socket connection to server > compute40-sjc1.prod.uber.internal/10.162.2.27:2181. Will not attempt to > authenticate using SA > Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 > [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:876] Socket connection established to > compute40-sjc1.prod.uber.internal/10.162.2.27:2181, initiating session > Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.852 > [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1158] Unable to read additional data from server > sessionid 0x270049a1666d1483, likely server has closed socket, closing socket > connecti > ... > Apr 08 22:59:50 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:50.538 > [RedirectMonitor STARTING-SendThread(compute37-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1158] Unable to read additional data from server > sessionid 0x270049a1666d1483, likely server has closed socket, closing socket > connecti > Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 > [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1032] Opening socket connection to server > compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to > authenticate using SA > Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.273 > [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1108] Client session timed out, have not heard from > server in 1635ms for sessionid 0x270049a1666d1483 > Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 > [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), > ClientCnxn$SendThread:1156] Client session timed out, have not heard from > server in 1635ms for sessionid 0x270049a1666d1483, closing socket connection > and at > Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.374 > [AsyncProcessor-7, Group$ActiveMembership:370] Temporary error cancelling > membership:/aurora/scheduler/member_0000000360 > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /aurora/scheduler/member_0000000360 > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)