[ 
https://issues.apache.org/jira/browse/HADOOP-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang reassigned HADOOP-14322:
--------------------------------------

    Assignee: Yongjun Zhang

> Incorrect host info may be reported in failover message
> -------------------------------------------------------
>
>                 Key: HADOOP-14322
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14322
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>
> This may apply to other components, but using HDFS as an example.
> When multiple threads use the same DFSClient to make RPC calls, they may 
> report incorrect NN host name in the failover message:
> {code}
> INFO [pool-3-thread-13] retry.RetryInvocationHandler 
> (RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete 
> of class ClientNamenodeProtocolTranslatorPB over *a.b.c.d*:8020. Trying to 
> fail over immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
> {code}
> where *a.b.c.d* is the RPC proxy corresponds to the active NN, which confuses 
> user to think failover is not behaving correctly. Because *a.b.c.d*  is 
> expected to be the proxy corresponding to the standby NN here instead.
> The reason is that when the ProxyDescriptor data field of 
> RetryInvocationHandler may be shared by multiple threads that do the RPC 
> calls, the failover done by one thread (which changed the rpc proxy) may be 
> visible to other threads when the other threads report the above message. 
> An example sequence: 
> # multiple threads start with the same SNN to do RPC calls, 
> # all threads discover that a failover is needed, 
> # thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
> # other threads reports the above message with the proxyInfo changed by 
> thread X, and reported ANN instead of SNN in the message.
> Some details:
> RetryInvocationHandler does the following when failing over:
> {code}
>   synchronized void failover(long expectedFailoverCount, Method method,
>                                int callId) {
>       // Make sure that concurrent failed invocations only cause a single
>       // actual failover.
>       if (failoverCount == expectedFailoverCount) {
>         fpp.performFailover(proxyInfo.proxy);
>         failoverCount++;
>       } else {
>         LOG.warn("A failover has occurred since the start of call #" + callId
>             + " " + proxyInfo.getString(method.getName()));
>       }
>       proxyInfo = fpp.getProxy();
>     }
> {code}
> and changed the proxyInfo in the ProxyDescriptor.
> While the log method below report message with ProxyDescriotor's proxyinfo:
> {code}
> private void log(final Method method, final boolean isFailover,
>       final int failovers, final long delay, final Exception ex) {
> ......
>    final StringBuilder b = new StringBuilder()
>         .append(ex + ", while invoking ")
>         .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
>     if (failovers > 0) {
>       b.append(" after ").append(failovers).append(" failover attempts");
>     }
>     b.append(isFailover? ". Trying to failover ": ". Retrying ");
>     b.append(delay > 0? "after sleeping for " + delay + "ms.": 
> "immediately.");
> {code}
> and so does  {{handleException}} method do
> {code}
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Exception while invoking call #" + callId + " "
>               + proxyDescriptor.getProxyInfo().getString(method.getName())
>               + ". Not retrying because " + retryInfo.action.reason, e);
>         }
> {code}
> FailoverProxyProvider
> {code}
>    public String getString(String methodName) {
>       return proxy.getClass().getSimpleName() + "." + methodName
>           + " over " + proxyInfo;
>     }
>     @Override
>     public String toString() {
>       return proxy.getClass().getSimpleName() + " over " + proxyInfo;
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to