[
https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tony Wu updated HADOOP-12482:
-----------------------------
Attachment: HADOOP-12482.002.patch
In v2 patch:
* Rebased to latest trunk.
Manually verified the reported failed test cases again (on Linux and with
native option) and they pass without error:
{code}
$ mvn
-Dtest=TestMetricsSourceAdapter,TestDecayRpcScheduler,TestCopyPreserveFlag,TestReloadingX509TrustManager,TestGangliaMetrics
test -Pnative
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.apache.hadoop.fs.shell.TestCopyPreserveFlag
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.77 sec - in
org.apache.hadoop.fs.shell.TestCopyPreserveFlag
Running org.apache.hadoop.metrics2.impl.TestGangliaMetrics
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.504 sec - in
org.apache.hadoop.metrics2.impl.TestGangliaMetrics
Running org.apache.hadoop.metrics2.impl.TestMetricsSourceAdapter
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.447 sec - in
org.apache.hadoop.metrics2.impl.TestMetricsSourceAdapter
Running org.apache.hadoop.ipc.TestDecayRpcScheduler
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.98 sec - in
org.apache.hadoop.ipc.TestDecayRpcScheduler
Running org.apache.hadoop.security.ssl.TestReloadingX509TrustManager
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.538 sec - in
org.apache.hadoop.security.ssl.TestReloadingX509TrustManager
Results :
Tests run: 27, Failures: 0, Errors: 0, Skipped: 0
{code}
> Race condition in JMX cache update
> ----------------------------------
>
> Key: HADOOP-12482
> URL: https://issues.apache.org/jira/browse/HADOOP-12482
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.1
> Reporter: Tony Wu
> Assignee: Tony Wu
> Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch
>
>
> updateJmxCache() was updated in HADOOP-11301. However the patch introduced a
> race condition. In updateJmxCache() function in MetricsSourceAdapter.java:
> {code:java}
> private void updateJmxCache() {
> boolean getAllMetrics = false;
> synchronized (this) {
> if (Time.now() - jmxCacheTS >= jmxCacheTTL) {
> // temporarilly advance the expiry while updating the cache
> jmxCacheTS = Time.now() + jmxCacheTTL;
> if (lastRecs == null) {
> getAllMetrics = true;
> }
> } else {
> return;
> }
> if (getAllMetrics) {
> MetricsCollectorImpl builder = new MetricsCollectorImpl();
> getMetrics(builder, true);
> }
> updateAttrCache();
> if (getAllMetrics) {
> updateInfoCache();
> }
> jmxCacheTS = Time.now();
> lastRecs = null; // in case regular interval update is not running
> }
> }
> {code}
> Notice that getAllMetrics is set to true when:
> # jmxCacheTTL has passed
> # lastRecs == null
> lastRecs is set to null in the same function, but gets reassigned by
> getMetrics().
> However getMetrics() can be called from a different thread:
> # MetricsSystemImpl.onTimerEvent()
> # MetricsSystemImpl.publishMetricsNow()
> Consider the following sequence:
> # updateJmxCache() is called by getMBeanInfo() from a thread getting cached
> info.
> ** lastRecs is set to null.
> # metrics sources is updated with new value/field.
> # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a
> different thread getting the latest metrics.
> ** lastRecs is updated (!= null).
> # jmxCacheTTL passed.
> # updateJmxCache() is called again via getMBeanInfo().
> ** However because lastRecs is already updated (!= null), getAllMetrics will
> not be set to true. So updateInfoCache() is not called and getMBeanInfo()
> returns the old cached info.
> We ran into this issue on a cluster where a new metric did not get published
> until much later.
> The case can be made worse by a periodic call to getMetrics() (driven by an
> external program or script). In such case getMBeanInfo() may never be able to
> retrieve the new record.
> The desired behavior should be that updateJmxCache() will guarantee to call
> updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by
> updateJmxCache() itself.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)