Aaron Gresch created STORM-3107:
-----------------------------------

             Summary: Nimbus confused about leadership after crash
                 Key: STORM-3107
                 URL: https://issues.apache.org/jira/browse/STORM-3107
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 2.0.0
            Reporter: Aaron Gresch


Nimbus crashed and restarted without shutting down zookeeper due to a deadlock 
in the timer shutdown code.  This could however also happen for various other 
issues.  

 

The problem is that once Nimbus restarted, it was really confused about who the 
leader was:

 
{code:java}
2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for 
leader lock.
2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
assignments
2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
cleanup
2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
credential renewal.

2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology 
submission exception. (topology name='topology-testOverSubscribe-1')
java.lang.RuntimeException: not a leader, current leader is 
NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
        at 
org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
        at 
org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 ~[libthrift-0.9.3.jar:0.9.3]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_131]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
2018-05-24 09:27:40.771 o.a.s.b.BlobStoreUtils Timer-1 [ERROR] Could not 
download the blob with key: 
topology-testOverCapacityScheduling-2-1519992333-stormcode.ser
2018-05-24 09:27:40.771 o.a.t.s.TThreadPoolServer pool-37-thread-63 [ERROR] 
Error occurred during processing of message.
java.lang.RuntimeException: java.lang.RuntimeException: not a leader, current 
leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
        at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2961) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
        at 
org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 ~[libthrift-0.9.3.jar:0.9.3]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_131]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.RuntimeException: not a leader, current leader is 
NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
        at 
org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
        at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
        ... 9 more
{code}
The session timeout was set to 20 seconds, but we're exceeding this period, and 
Nimbus did not recover leadership.  It needed to be restarted manually to 
recover.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to