GrantPSpencer opened a new issue, #2973:
URL: https://github.com/apache/helix/issues/2973
### Describe the bug
NPE can occur in `IntermedaiteStateCalcStage` when applying pending messages
to the `intermediateStateMap`. Specifically, when it tries to apply a message
with DROPPED toState, it calls .remove(..) on a map that is null
```
2024/10/29 01:48:13.046 ERROR [GenericHelixController]
[HelixController-pipeline-default-CLUSTERNAME-(70ae9461_DEFAULT)] [helix] []
Exception while executing DEFAULT pipeline for cluster CLUSTERNAME. Will not
continue to next pipeline
java.lang.NullPointerException: null
at
org.apache.helix.controller.stages.IntermediateStateCalcStage.lambda$computeIntermediateMap$2(IntermediateStateCalcStage.java:868)
~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at java.util.HashMap.forEach(HashMap.java:1337) ~[?:?]
at
org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediateMap(IntermediateStateCalcStage.java:864)
~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at
org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediatePartitionState(IntermediateStateCalcStage.java:402)
~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at
org.apache.helix.controller.stages.IntermediateStateCalcStage.compute(IntermediateStateCalcStage.java:180)
~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at
org.apache.helix.controller.stages.IntermediateStateCalcStage.process(IntermediateStateCalcStage.java:85)
~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at
org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75)
~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at
org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903)
[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at
org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554)
[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
```
```
for (Map.Entry<Partition, Map<String, Message>> entry :
pendingMessageMap.entrySet()) {
entry.getValue().forEach((key, value) -> {
if (!value.getToState().equals(HelixDefinedState.DROPPED.name())) {
intermediateStateMap.setState(entry.getKey(), value.getTgtName(),
value.getToState());
} else {
intermediateStateMap.getStateMap().get(entry.getKey()).remove(value.getTgtName());
}
});
```
### To Reproduce
Unable to reproduce outside of unit tests. Currently I think the behavior
occurs when:
1. Resource has partition with 1 replica .
2. Message is sent to instance A to drop replica, but replica does not exist
in instance's current state anymore.
3. Controller snapshots cluster and runs pipeline.
4. IntermediateStateCalc will attempt to call .remove() on a map that does
not exist
I think the above state can be reached when:
1. Race condition where node reads the message, drops the current state, but
hasn't deleted the message yet so it is still seen as a pending message
2. Node goes offline so there is no current state
### Expected behavior
Failing to remove because map is null should not error out in my opinion.
Can add null check or a getOrDefault to return empty map
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]