[
https://issues.apache.org/jira/browse/HADOOP-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongjun Zhang reassigned HADOOP-14322:
--------------------------------------
Assignee: Yongjun Zhang
> Incorrect host info may be reported in failover message
> -------------------------------------------------------
>
> Key: HADOOP-14322
> URL: https://issues.apache.org/jira/browse/HADOOP-14322
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
>
> This may apply to other components, but using HDFS as an example.
> When multiple threads use the same DFSClient to make RPC calls, they may
> report incorrect NN host name in the failover message:
> {code}
> INFO [pool-3-thread-13] retry.RetryInvocationHandler
> (RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete
> of class ClientNamenodeProtocolTranslatorPB over *a.b.c.d*:8020. Trying to
> fail over immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category WRITE is not supported in state standby. Visit
> https://s.apache.org/sbnn-error
> {code}
> where *a.b.c.d* is the RPC proxy corresponds to the active NN, which confuses
> user to think failover is not behaving correctly. Because *a.b.c.d* is
> expected to be the proxy corresponding to the standby NN here instead.
> The reason is that when the ProxyDescriptor data field of
> RetryInvocationHandler may be shared by multiple threads that do the RPC
> calls, the failover done by one thread (which changed the rpc proxy) may be
> visible to other threads when the other threads report the above message.
> An example sequence:
> # multiple threads start with the same SNN to do RPC calls,
> # all threads discover that a failover is needed,
> # thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
> # other threads reports the above message with the proxyInfo changed by
> thread X, and reported ANN instead of SNN in the message.
> Some details:
> RetryInvocationHandler does the following when failing over:
> {code}
> synchronized void failover(long expectedFailoverCount, Method method,
> int callId) {
> // Make sure that concurrent failed invocations only cause a single
> // actual failover.
> if (failoverCount == expectedFailoverCount) {
> fpp.performFailover(proxyInfo.proxy);
> failoverCount++;
> } else {
> LOG.warn("A failover has occurred since the start of call #" + callId
> + " " + proxyInfo.getString(method.getName()));
> }
> proxyInfo = fpp.getProxy();
> }
> {code}
> and changed the proxyInfo in the ProxyDescriptor.
> While the log method below report message with ProxyDescriotor's proxyinfo:
> {code}
> private void log(final Method method, final boolean isFailover,
> final int failovers, final long delay, final Exception ex) {
> ......
> final StringBuilder b = new StringBuilder()
> .append(ex + ", while invoking ")
> .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
> if (failovers > 0) {
> b.append(" after ").append(failovers).append(" failover attempts");
> }
> b.append(isFailover? ". Trying to failover ": ". Retrying ");
> b.append(delay > 0? "after sleeping for " + delay + "ms.":
> "immediately.");
> {code}
> and so does {{handleException}} method do
> {code}
> if (LOG.isDebugEnabled()) {
> LOG.debug("Exception while invoking call #" + callId + " "
> + proxyDescriptor.getProxyInfo().getString(method.getName())
> + ". Not retrying because " + retryInfo.action.reason, e);
> }
> {code}
> FailoverProxyProvider
> {code}
> public String getString(String methodName) {
> return proxy.getClass().getSimpleName() + "." + methodName
> + " over " + proxyInfo;
> }
> @Override
> public String toString() {
> return proxy.getClass().getSimpleName() + " over " + proxyInfo;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]