[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sangjin Lee updated HADOOP-12482: --------------------------------- Fix Version/s: 2.6.5 Cherry-picked it to 2.6.5 (trivial). > Race condition in JMX cache update > ---------------------------------- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.7.1 > Reporter: Tony Wu > Assignee: Tony Wu > Fix For: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1 > > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch, HADOOP-12482.005.patch, > HADOOP-12482.006.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org