[
https://issues.apache.org/jira/browse/HADOOP-11361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366789#comment-15366789
]
Yongjun Zhang commented on HADOOP-11361:
----------------------------------------
Hi [~vinayrpet] and [~ozawa],
Thanks for the discussion.
First of all, I tend to agree with Tsuyoshi on the AND/OR analysis, however, I
saw that all calls to {{getMetrics}} method passed true to {{all}} parameter,
so this change won't impact the current final result. But we certainly can
address this issue, either as a separate jira, or a side effect of the fix here.
Thanks [~vinayrpet] for the explanation of the race condition. Sorry I did
miss the "return" statement when making the earlier comment.
It looks to me that the acquirement of {{lastRecs}} and the {{updateAttrCache}}
should be protected in a same {{synchronized(this)}} block, to avoid this race
condition. And indeed the two threads Vinay described intend to have their own
builder, their own lastRecs. Say, we can break down the {{getMetrics}} method
into two parts,
{code}
void getMetricsPart1(MetricsCollectorImpl builder,
boolean all) {
builder.setRecordFilter(recordFilter).setMetricFilter(metricFilter);
synchronized(this) {
if (lastRecs == null || jmxCacheTS == 0) {
all = true; // Get all the metrics to populate the sink caches
}
}
try {
source.getMetrics(builder, all);
} catch (Exception e) {
LOG.error("Error getting metrics from source "+ name, e);
}
for (MetricsRecordBuilderImpl rb : builder) {
for (MetricsTag t : injectedTags) {
rb.add(t);
}
}
}
Iterable<MetricsRecordImpl> getMetricsPart2(MetricsCollectorImpl builder) {
lastRecs = builder.getRecords();
return lastRecs;
}
{code}
and change
{code}
if (getAllMetrics) {
MetricsCollectorImpl builder = new MetricsCollectorImpl();
getMetrics(builder, true);
}
synchronized(this) {
updateAttrCache();
if (getAllMetrics) {
updateInfoCache();
}
jmxCacheTS = Time.now();
lastRecs = null; // in case regular interval update is not running
lastRecsCleared = true;
}
{code}
to
{code}
MetricsCollectorImpl builder = NULL;
if (getAllMetrics) {
builder = new MetricsCollectorImpl();
getMetricsPart1(builder, true);
}
synchronized(this) {
if (getAllMetrics) {
getMetricsPart2(builder);
}
updateAttrCache();
if (getAllMetrics) {
updateInfoCache();
}
jmxCacheTS = Time.now();
lastRecs = null; // in case regular interval update is not running
lastRecsCleared = true;
}
{code}
This is just to throw an idea to see what's the best solution. The method names
can be adjusted.
What do you guys think?
Thanks.
> Fix a race condition in MetricsSourceAdapter.updateJmxCache
> -----------------------------------------------------------
>
> Key: HADOOP-11361
> URL: https://issues.apache.org/jira/browse/HADOOP-11361
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.4.1, 2.5.1, 2.6.0
> Reporter: Brahma Reddy Battula
> Assignee: Brahma Reddy Battula
> Attachments: HADOOP-111361-003.patch, HADOOP-11361-002.patch,
> HADOOP-11361-004.patch, HADOOP-11361.patch, HDFS-7487.patch
>
>
> {noformat}
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.updateAttrCache(MetricsSourceAdapter.java:247)
> at
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.updateJmxCache(MetricsSourceAdapter.java:177)
> at
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getAttribute(MetricsSourceAdapter.java:102)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:647)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]