Rui Li created STORM-3713:
-----------------------------

             Summary: Possible race condition between zookeeper sync-up and 
killing topology
                 Key: STORM-3713
                 URL: https://issues.apache.org/jira/browse/STORM-3713
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Rui Li
            Assignee: Rui Li


When nimbus re-gains leadership, the leaderCallback will sync-up with 
zookeeper: 
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106]
 
[https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]
   When killing topology, both zookeeper and in-memory assignments map get 
cleaned up. 
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]
   However, in the syncRemoteAssignments call, it will get the information from 
zookeeper into stormIds. The after some processing (including deserialization), 
it will then put it into local in-memory assignments backend. If the zookeeper 
deletion happens between these two steps, then there will be mismatch between 
remote zookeeper and local backends.   We found this issue since we observed a 
NPE when making assignments. 2020-11-04 19:56:17.703 o.a.s.d.n.Nimbus timer 
[ERROR] Error while processing event java.lang.RuntimeException: 
java.lang.NullPointerException at 
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419) 
~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.StormTimer$1.run(StormTimer.java:110) 
~[storm-client-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) 
[storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException at 
org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199)
 ~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029) 
~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109)
 ~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272)
 ~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467) 
~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453) 
~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397) 
~[storm-server-2.3.0.y.jar:2.3.0.y] at 
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415) 
~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703 
o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event   
[https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
   The existingAssignment comes from in-memory backend while the 
topologyToExecutors comes from zookeeper which did not include a deleted 
topolgy id. 
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
 
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
 
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199]
 So NPE happens.      



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to