Yes retry would be the most logical way to work around this. The code is kind of odd in that there are two separate locations in ZK for this information. Leader election simply stores who the leader is at `/leader-lock`, but the information about all of the nimbus instances that are alive is stored under `/nimbuses`. What you have run into is where they are not in sync with each other. The leader lock said nimbus-A is the leader and nimbuses had no knowledge of nimbus-A at all. If nimbus-A was crashing during this period of time then it is a race and we need to fix it with retry (I'll file a JIRA for this anyways as we should have this in no matter what). If nimbus-A was not crashing then ZK some how messed up or we some how messed up. The only way that could happen on our end is if for some reason we have two different connections to ZK, one for leader election and another for writing to nimbuses. If that is not the case, and this is reproducible, then yes the first thing to do is to turn on debug logging, and try to grab the snapshot/edit logs for your ZK cluster right after this happened. I am really hopeful that it was nimbus crashing.
- Bobby On Sunday, May 14, 2017, 4:03:22 PM CDT, S G <[email protected]> wrote:Thanks Bobby, This looks like a serious issue to me. Any ideas how I can provide more information (like enable some logs etc) to gain more insight into this problem? It might be a good idea to add some retry logic or some waiting logic on the node that comes up empty handed so that it handles the error more gracefully rather than crashing with a NullPointerException? Also, the leader election is supposed to happen through zookeeper, right? Isn't the new leader becoming a leader after saving its state in zookeeper? Because then the other nodes should not come empty handed. If no, then it seems like a bug and the leader should persist the state in zookeeper first before becoming a leader. > looks like it is caused by trying to read a NimbusSummary for the leader but not being able to find it Instead of crashing, this should trigger a new leader election IMO with some good warning messages in the logs. Disclaimer: I have not seen the actual code that does the nimbus leader election. Above are just some suggestions based on my limited knowledge. So please forgive any outrageous/obvious ideas :) On Tue, May 9, 2017 at 1:58 PM, Bobby Evans <[email protected]> wrote: > This looks like something odd is happening with leader election. The > exception looks like it is caused by trying to read a NimbusSummary for the > leader but not being able to find it. So it could mean that a leader is > elected and is then crashing quickly enough that the other node when it > tries to read this loses the race and comes up empty handed. But if you > only have a single nimbus configured then this is not the case and > something else worse is happening. > > > - Bobby > > On Monday, May 8, 2017, 4:41:13 PM CDT, S G <[email protected]> > wrote:Hi, > > I am trying to upgrade from 1.0.2 to 1.1.0 version of Storm. > And I see the below exception happening randomly on the Nimbus node. > When it happens, Nimbus is unable to accept any new topologies. > > > java.lang.NullPointerException: null > at > clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:301) > ~[clojure-1.7.0.jar:?] > at > org.apache.storm.daemon.nimbus$mk_reified_nimbus$ > reify__10782.getLeader(nimbus.clj:2383) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.generated.Nimbus$Processor$getLeader. > getResult(Nimbus.java:3944) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.generated.Nimbus$Processor$getLeader. > getResult(Nimbus.java:3928) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.security.auth.SimpleTransportPlugin$ > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.thrift.server.AbstractNonblockingServer$ > FrameBuffer.invoke(AbstractNonblockingServer.java:518) > ~[storm-core-1.1.0.jar:1.1.0] > at > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) > ~[storm-core-1.1.0.jar:1.1.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > [?:1.8.0_51] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > [?:1.8.0_51] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_51] > > > I have not been able to isolate what causes this exception. > Any help would be appreciated. > > Thanks > SG >
