[
https://issues.apache.org/jira/browse/HADOOP-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jing Zhao updated HADOOP-10630:
-------------------------------
Status: Patch Available (was: Open)
> Possible race condition in RetryInvocationHandler
> -------------------------------------------------
>
> Key: HADOOP-10630
> URL: https://issues.apache.org/jira/browse/HADOOP-10630
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Jing Zhao
> Assignee: Jing Zhao
> Attachments: HADOOP-10630.000.patch
>
>
> In one of our system tests with NameNode HA setup, we ran 300 threads in
> LoadGenerator. While one of the NameNodes was already in the active state and
> started to serve, we still saw one of the client thread failed all the
> retries in a 20 seconds window. In the meanwhile, we saw a lot of following
> warning msg in the log:
> {noformat}
> WARN retry.RetryInvocationHandler: A failover has occurred since the start of
> this method invocation attempt.
> {noformat}
> After checking the code, we see the following code in RetryInvocationHandler:
> {code}
> while (true) {
> // The number of times this invocation handler has ever been failed
> over,
> // before this method invocation attempt. Used to prevent concurrent
> // failed method invocations from triggering multiple failover attempts.
> long invocationAttemptFailoverCount;
> synchronized (proxyProvider) {
> invocationAttemptFailoverCount = proxyProviderFailoverCount;
> }
> ......
> if (action.action == RetryAction.RetryDecision.FAILOVER_AND_RETRY) {
> // Make sure that concurrent failed method invocations only cause
> a
> // single actual fail over.
> synchronized (proxyProvider) {
> if (invocationAttemptFailoverCount ==
> proxyProviderFailoverCount) {
> proxyProvider.performFailover(currentProxy.proxy);
> proxyProviderFailoverCount++;
> currentProxy = proxyProvider.getProxy();
> } else {
> LOG.warn("A failover has occurred since the start of this
> method"
> + " invocation attempt.");
> }
> }
> invocationFailoverCount++;
> }
> ......
> {code}
> We can see we refresh the value of currentProxy only when the thread performs
> the failover (while holding the monitor of the proxyProvider). Because
> "currentProxy" is not volatile, a thread that does not perform the failover
> (in which case it will log the warning msg) may fail to get the new value of
> currentProxy.
--
This message was sent by Atlassian JIRA
(v6.2#6252)