[I] WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap [helix]

via GitHub Fri, 23 Aug 2024 20:54:48 -0700


GrantPSpencer opened a new issue, #2891:
URL: https://github.com/apache/helix/issues/2891


   ### Describe the bug
   Waged pipeline will fail due to NPE during `BestPossibleStateCalcStage` as 
it will call `checkAndReduceInstanceCapacity` on an instance that is not in the 
`WagedInstanceCapacity`'s `_instanceCapacityMap`. This will occur when the 
`WagedInstanceCapacity` is calculated at point A, a new instance is added at 
point B, and then at at point C the `WagedInstanceCapacity` is not refreshed to 
include this instance during the `CurrentStateComputationStage` The specific 
circumstances are detailed below
   
   
   ### To Reproduce
   
   1. Add at least 1 waged enabled resource to a cluster and rebalance so 
assignments are made. 
   2. Drop all resources from the cluster. 
   3. Add a new instance (`"new_instance"`)to the cluster.
   4. Add 1 waged enabled resource to the cluster
   5. NPE will occur
   
   This occurs because `"new_instance"` is an assignable instance and is in the 
newly calculated preference list. So `checkAndReduceInstanceCapacity` is called 
on the instance. However, `WagedInstanceCapacity`'s `_instanceCapacityMap` has 
not been updated and therefore has a stale view that does not include 
`"new_instance"`
   
   
   This is because the `skipCapacityCalculation` method (a very effective 
optimization) causes the `CurrentStateComputationStage` to not refresh the 
cache if  there are no resources in the resourceMap. However, the resourceMap 
is constructed based off the idealStates in the cluster which does not exist at 
this point. When a resource is added, a `ResourceConfigChange` event is first 
fired. Afterwards, an `IdealStateChange` will fire. In this case of a new 
resource being added, the `CurrentStateComputationStage` will not recalculate 
the `WagedInstanceCapacity` as the resourceMap is empty when we encounter a 
`ResourceConfigChange` and then we do not recalculate on subsequent 
`IdealStateChange`
   
   Adding a WAGED resource to a new cluster does not trigger this NPE because 
there is no WagedInstanceCapacity so 
   ```
       if (Objects.isNull(cache.getWagedInstanceCapacity())) {
         return false;
       }
   ```
    will force it to be refreshed. 
   
   https://github.com/GrantPSpencer/helix/pull/32
   The testcase in this draft PR will fail on master and follows the steps 
outlined above. 
   
   ### Expected behavior
   WagedInstanceCapacity should be recalculated in the case of a new resource 
being added prior to the BestPossibleStateCalcStage.
   
   ### Additional context
   ```
   10539 
[HelixController-pipeline-default-TestWagedNPE_cluster-(45df0f8d_DEFAULT)] 
ERROR org.apache.helix.controller.GenericHelixController [] - Exception while 
executing DEFAULT pipeline for cluster TestWagedNPE_cluster. Will not continue 
to next pipeline
   java.lang.NullPointerException: null
        at 
org.apache.helix.controller.rebalancer.waged.WagedInstanceCapacity.checkAndReduceInstanceCapacity(WagedInstanceCapacity.java:206)
 ~[classes/:?]
        at 
org.apache.helix.controller.dataproviders.ResourceControllerDataProvider.checkAndReduceCapacity(ResourceControllerDataProvider.java:535)
 ~[classes/:?]
        at 
org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossibleStateForPartition(DelayedAutoRebalancer.java:377)
 ~[classes/:?]
        at 
org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:271)
 ~[classes/:?]
        at 
org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:54)
 ~[classes/:?]
        at 
org.apache.helix.controller.rebalancer.waged.WagedRebalancer.lambda$computeNewIdealStates$0(WagedRebalancer.java:281)
 ~[classes/:?]
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
        at 
java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1692) ~[?:?]
        at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) 
~[?:?]
        at 
java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) ~[?:?]
        at 
java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:290) ~[?:?]
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java) ~[?:?]
        at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:408) 
~[?:?]
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736) 
~[?:?]
        at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159) 
~[?:?]
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
 ~[?:?]
        at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[?:?]
        at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]
        at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661) 
~[?:?]
        at 
org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:277)
 ~[classes/:?]
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:445)
 ~[classes/:?]
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:289)
 ~[classes/:?]
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:94)
 ~[classes/:?]
        at 
org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) 
~[classes/:?]
        at 
org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903)
 [classes/:?]
        at 
org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554)
 [classes/:?]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] WagedInstanceCapacity Null Pointer Exception due to stale _instanceCapacityMap [helix]

Reply via email to