[ 
https://issues.apache.org/jira/browse/HADOOP-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HADOOP-10630:
-------------------------------

       Resolution: Fixed
    Fix Version/s: 2.5.0
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

I've committed this to trunk and branch-2. Thanks for the review, [~kihwal] and 
[~sureshms]! And thanks [~arpitgupta] for reporting the issue.

> Possible race condition in RetryInvocationHandler
> -------------------------------------------------
>
>                 Key: HADOOP-10630
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10630
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>             Fix For: 2.5.0
>
>         Attachments: HADOOP-10630.000.patch
>
>
> In one of our system tests with NameNode HA setup, we ran 300 threads in 
> LoadGenerator. While one of the NameNodes was already in the active state and 
> started to serve, we still saw one of the client thread failed all the 
> retries in a 20 seconds window. In the meanwhile, we saw a lot of following 
> warning msg in the log:
> {noformat}
> WARN retry.RetryInvocationHandler: A failover has occurred since the start of 
> this method invocation attempt.
> {noformat}
> After checking the code, we see the following code in RetryInvocationHandler:
> {code}
>   while (true) {
>       // The number of times this invocation handler has ever been failed 
> over,
>       // before this method invocation attempt. Used to prevent concurrent
>       // failed method invocations from triggering multiple failover attempts.
>       long invocationAttemptFailoverCount;
>       synchronized (proxyProvider) {
>         invocationAttemptFailoverCount = proxyProviderFailoverCount;
>       }
>       ......
>       if (action.action == RetryAction.RetryDecision.FAILOVER_AND_RETRY) {
>             // Make sure that concurrent failed method invocations only cause 
> a
>             // single actual fail over.
>             synchronized (proxyProvider) {
>               if (invocationAttemptFailoverCount == 
> proxyProviderFailoverCount) {
>                 proxyProvider.performFailover(currentProxy.proxy);
>                 proxyProviderFailoverCount++;
>                 currentProxy = proxyProvider.getProxy();
>               } else {
>                 LOG.warn("A failover has occurred since the start of this 
> method"
>                     + " invocation attempt.");
>               }
>             }
>             invocationFailoverCount++;
>           }
>      ......
> {code}
> We can see we refresh the value of currentProxy only when the thread performs 
> the failover (while holding the monitor of the proxyProvider). Because 
> "currentProxy" is not volatile,  a thread that does not perform the failover 
> (in which case it will log the warning msg) may fail to get the new value of 
> currentProxy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to