Aaron Gresch created STORM-3107: ----------------------------------- Summary: Nimbus confused about leadership after crash Key: STORM-3107 URL: https://issues.apache.org/jira/browse/STORM-3107 Project: Apache Storm Issue Type: Bug Affects Versions: 2.0.0 Reporter: Aaron Gresch
Nimbus crashed and restarted without shutting down zookeeper due to a deadlock in the timer shutdown code. This could however also happen for various other issues. The problem is that once Nimbus restarted, it was really confused about who the leader was: {code:java} 2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for leader lock. 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping assignments 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping cleanup 2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping credential renewal. 2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology submission exception. (topology name='topology-testOverSubscribe-1') java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] 2018-05-24 09:27:40.771 o.a.s.b.BlobStoreUtils Timer-1 [ERROR] Could not download the blob with key: topology-testOverCapacityScheduling-2-1519992333-stormcode.ser 2018-05-24 09:27:40.771 o.a.t.s.TThreadPoolServer pool-37-thread-63 [ERROR] Error occurred during processing of message. java.lang.RuntimeException: java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2961) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] Caused by: java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y] ... 9 more {code} The session timeout was set to 20 seconds, but we're exceeding this period, and Nimbus did not recover leadership. It needed to be restarted manually to recover. -- This message was sent by Atlassian JIRA (v7.6.3#76005)