Ethan Li created STORM-3622:
-------------------------------
Summary: Race Condition in CachedThreadStatesGaugeSet registered
at SystemBolt
Key: STORM-3622
URL: https://issues.apache.org/jira/browse/STORM-3622
Project: Apache Storm
Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Ethan Li
Assignee: Ethan Li
We noticed that with the change in https://github.com/apache/storm/pull/3242,
there is a race condition causing NPE.
{code:java}
2020-04-14 18:22:12.997 o.a.s.u.Utils Thread-17-__acker-executor[16, 16]
[ERROR] Async loop died!
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.storm.executor.Executor.accept(Executor.java:291)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:131)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.utils.JCQueue.consume(JCQueue.java:111)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:172)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.utils.Utils$1.run(Utils.java:434)
[storm-client-2.2.0.y.jar:2.2.0.y]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
Caused by: java.lang.NullPointerException
at
com.codahale.metrics.jvm.ThreadStatesGaugeSet.getThreadCount(ThreadStatesGaugeSet.java:95)
~[metrics-jvm-3.2.6.jar:3.2.6]
at
com.codahale.metrics.jvm.ThreadStatesGaugeSet.access$000(ThreadStatesGaugeSet.java:20)
~[metrics-jvm-3.2.6.jar:3.2.6]
at
com.codahale.metrics.jvm.ThreadStatesGaugeSet$1.getValue(ThreadStatesGaugeSet.java:56)
~[metrics-jvm-3.2.6.jar:3.2.6]
at org.apache.storm.executor.Executor.addV2Metrics(Executor.java:344)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.executor.Executor.metricsTick(Executor.java:320)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at
org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:218)
~[storm-client-2.2.0.y.jar:2.2.0.y]
at org.apache.storm.executor.Executor.accept(Executor.java:287)
~[storm-client-2.2.0.y.jar:2.2.0.y]
... 6 more
{code}
This is due to a race condition in CachedGauge
https://github.com/dropwizard/metrics/blob/v3.2.6/metrics-core/src/main/java/com/codahale/metrics/CachedGauge.java#L49-L53
There are two issues here.
The first one is
https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L335-L341.
This makes all the executors to get values for all the metrics. So multiple
threads will access the same metric.
So the threads gauges are now accessed by multiple threads. But in CachedGauge,
{code:java}
@Override
public T getValue() {
if (shouldLoad()) {
this.value = loadValue();
}
return value;
}
{code}
this method is not thread-safe. Two threads can reach to getValue at the same
time.
The first thread reaching shouldLoad knows it needs to reload, so it calls the
next line this.value=loadValue()
The second thread is a little bit late so shouldLoad returns false. Then it
returns the value directly.
There is a race condition between first thread calling loadValue() and the
second thread returning value.
If the first thread finishes loadValue() first, both values returned to the
threads are the same value (and current value). But if the second thread
returns earlier, the second thread gets the original value (which is null ),
hence NPE.
To summarize, the second issue is CachedThreadStatesGaugeSet is not thread-safe
To fix this NPE, we should avoid using CachedThreadStatesGaugeSet.
But we still need to fix
https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L335-L341
to avoid unnecessary computations and redundant metrics.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)