[ 
https://issues.apache.org/jira/browse/STORM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Li updated STORM-3622:
----------------------------
    Component/s: storm-client

> Race Condition in CachedThreadStatesGaugeSet registered at SystemBolt
> ---------------------------------------------------------------------
>
>                 Key: STORM-3622
>                 URL: https://issues.apache.org/jira/browse/STORM-3622
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-client
>    Affects Versions: 2.2.0
>            Reporter: Ethan Li
>            Assignee: Ethan Li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.2.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We noticed that with the change in https://github.com/apache/storm/pull/3242, 
> there is a race condition causing NPE.
> {code:java}
> 2020-04-14 18:22:12.997 o.a.s.u.Utils Thread-17-__acker-executor[16, 16] 
> [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.NullPointerException
>  at org.apache.storm.executor.Executor.accept(Executor.java:291) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:131) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.utils.JCQueue.consume(JCQueue.java:111) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:172) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.utils.Utils$1.run(Utils.java:434) 
> [storm-client-2.2.0.y.jar:2.2.0.y]
>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
> Caused by: java.lang.NullPointerException
>  at 
> com.codahale.metrics.jvm.ThreadStatesGaugeSet.getThreadCount(ThreadStatesGaugeSet.java:95)
>  ~[metrics-jvm-3.2.6.jar:3.2.6]
>  at 
> com.codahale.metrics.jvm.ThreadStatesGaugeSet.access$000(ThreadStatesGaugeSet.java:20)
>  ~[metrics-jvm-3.2.6.jar:3.2.6]
>  at 
> com.codahale.metrics.jvm.ThreadStatesGaugeSet$1.getValue(ThreadStatesGaugeSet.java:56)
>  ~[metrics-jvm-3.2.6.jar:3.2.6]
>  at org.apache.storm.executor.Executor.addV2Metrics(Executor.java:344) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.Executor.metricsTick(Executor.java:320) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at 
> org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:218)
>  ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  at org.apache.storm.executor.Executor.accept(Executor.java:287) 
> ~[storm-client-2.2.0.y.jar:2.2.0.y]
>  ... 6 more
> {code}
> This is due to a race condition in CachedGauge 
> https://github.com/dropwizard/metrics/blob/v3.2.6/metrics-core/src/main/java/com/codahale/metrics/CachedGauge.java#L49-L53
> There are two issues here.
> The first one is STORM-3623. 
> This makes all the executors to get values for all the metrics. So multiple 
> threads will access the same metric.
> So the threads gauges are now accessed by multiple threads. But in 
> CachedGauge,
> {code:java}
>  @Override
>     public T getValue() {
>         if (shouldLoad()) {
>             this.value = loadValue();
>         }
>         return value;
>     }
> {code}
> this method is not thread-safe. Two threads can reach to getValue at the same 
> time.
> The first thread reaching shouldLoad knows it needs to reload, so it calls 
> the next line this.value=loadValue()
> The second thread is a little bit late so shouldLoad returns false. Then it 
> returns the value directly.
> There is a race condition between first thread calling loadValue() and the 
> second thread returning value.
> If the first thread finishes loadValue() first, both values returned to the 
> threads are the same value (and current value). But if the second thread 
> returns earlier, the second thread gets the original value (which is null ), 
> hence NPE.
> To summarize, the second issue is CachedThreadStatesGaugeSet is not 
> thread-safe
> To fix this NPE, we should avoid using CachedThreadStatesGaugeSet. 
> But we still need to fix STORM-3623  to avoid unnecessary computations and 
> redundant metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to