[
https://issues.apache.org/jira/browse/STORM-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rui Li updated STORM-3713:
--------------------------
Description:
When nimbus re-gains leadership, the leaderCallback will sync-up with zookeeper:
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106]
[https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]
When killing topology, both zookeeper and in-memory assignments map get cleaned
up.
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]
However, in the syncRemoteAssignments call, it will get the information from
zookeeper into stormIds. The after some processing (including deserialization),
it will then put it into local in-memory assignments backend. If the zookeeper
deletion happens between these two steps, then there will be mismatch between
remote zookeeper and local backends.
We found this issue since we observed a NPE when making assignments.
{code:java}
2020-11-04 19:56:17.703 o.a.s.d.n.Nimbus timer [ERROR] Error while processing
event java.lang.RuntimeException: java.lang.NullPointerException at
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.StormTimer$1.run(StormTimer.java:110)
~[storm-client-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226)
[storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException at
org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415)
~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703
o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event
{code}
[https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
The existingAssignment comes from in-memory backend while the
topologyToExecutors comes from zookeeper which did not include a deleted
topolgy id.
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199]
So NPE happens.
was:
When nimbus re-gains leadership, the leaderCallback will sync-up with zookeeper:
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106]
[https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]
When killing topology, both zookeeper and in-memory assignments map get cleaned
up.
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]
However, in the syncRemoteAssignments call, it will get the information from
zookeeper into stormIds. The after some processing (including deserialization),
it will then put it into local in-memory assignments backend. If the zookeeper
deletion happens between these two steps, then there will be mismatch between
remote zookeeper and local backends.
We found this issue since we observed a NPE when making assignments. 2020-11-04
19:56:17.703 o.a.s.d.n.Nimbus timer [ERROR] Error while processing event
java.lang.RuntimeException: java.lang.NullPointerException at
{code}
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.StormTimer$1.run(StormTimer.java:110)
~[storm-client-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226)
[storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException at
org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397)
~[storm-server-2.3.0.y.jar:2.3.0.y] at
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415)
~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703
o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event
{code}
[https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
The existingAssignment comes from in-memory backend while the
topologyToExecutors comes from zookeeper which did not include a deleted
topolgy id.
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199]
So NPE happens.
> Possible race condition between zookeeper sync-up and killing topology
> ----------------------------------------------------------------------
>
> Key: STORM-3713
> URL: https://issues.apache.org/jira/browse/STORM-3713
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Rui Li
> Assignee: Rui Li
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When nimbus re-gains leadership, the leaderCallback will sync-up with
> zookeeper:
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106]
>
> [https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]
>
> When killing topology, both zookeeper and in-memory assignments map get
> cleaned up.
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]
>
> However, in the syncRemoteAssignments call, it will get the information from
> zookeeper into stormIds. The after some processing (including
> deserialization), it will then put it into local in-memory assignments
> backend. If the zookeeper deletion happens between these two steps, then
> there will be mismatch between remote zookeeper and local backends.
> We found this issue since we observed a NPE when making assignments.
> {code:java}
> 2020-11-04 19:56:17.703 o.a.s.d.n.Nimbus timer [ERROR] Error while processing
> event java.lang.RuntimeException: java.lang.NullPointerException at
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.StormTimer$1.run(StormTimer.java:110)
> ~[storm-client-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226)
> [storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException
> at
> org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] at
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415)
> ~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703
> o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event
> {code}
> [https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
>
> The existingAssignment comes from in-memory backend while the
> topologyToExecutors comes from zookeeper which did not include a deleted
> topolgy id.
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
>
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]
>
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199]
> So NPE happens.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)