[
https://issues.apache.org/jira/browse/HADOOP-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongjun Zhang updated HADOOP-14312:
-----------------------------------
Attachment: HADOOP-14312.001.patch
> RetryInvocationHandler may report ANN as SNN in messages.
> ----------------------------------------------------------
>
> Key: HADOOP-14312
> URL: https://issues.apache.org/jira/browse/HADOOP-14312
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Attachments: HADOOP-14312.001.patch
>
>
> When multiple threads use the same DFSClient to make RPC calls, they may
> report incorrect NN host name in messages like
> INFO [pool-3-thread-13] retry.RetryInvocationHandler
> (RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete
> of class ClientNamenodeProtocolTranslatorPB over
> hdpb-nn0001.prn.parsec.apple.com/*a.b.c.d*:8020. Trying to fail over
> immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category WRITE is not supported in state standby. Visit
> https://s.apache.org/sbnn-error
> where *a.b.c.d* is the active NN, which confuses user to think failover is
> not behaving correctly.
> The reason is that the ProxyDescriptor data field of RetryInvocationHandler
> may be shared by multiple threads that do the RPC calls, the failover done by
> one thread may be visible to other threads when reporting the above kind of
> message.
> As an example,
> # multiple threads start with the same SNN to do RPC calls,
> # all threads discover that a failover is needed,
> # thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
> # other threads reports the above message with the proxyInfo changed by
> thread X, and reported ANN instead of SNN in the message.
> Some details:
> RetryInvocationHandler does the following when failing over:
> {code}
> synchronized void failover(long expectedFailoverCount, Method method,
> int callId) {
> // Make sure that concurrent failed invocations only cause a single
> // actual failover.
> if (failoverCount == expectedFailoverCount) {
> fpp.performFailover(proxyInfo.proxy);
> failoverCount++;
> } else {
> LOG.warn("A failover has occurred since the start of call #" + callId
> + " " + proxyInfo.getString(method.getName()));
> }
> proxyInfo = fpp.getProxy();
> }
> {code}
> and changed the proxyInfo in the ProxyDescriptor.
> While the log method below report message with ProxyDescriotor's proxyinfo:
> {code}
> private void log(final Method method, final boolean isFailover,
> final int failovers, final long delay, final Exception ex) {
> ......
> final StringBuilder b = new StringBuilder()
> .append(ex + ", while invoking ")
> .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
> if (failovers > 0) {
> b.append(" after ").append(failovers).append(" failover attempts");
> }
> b.append(isFailover? ". Trying to failover ": ". Retrying ");
> b.append(delay > 0? "after sleeping for " + delay + "ms.":
> "immediately.");
> {code}
> and so does {{handleException}} method do
> {code}
> if (LOG.isDebugEnabled()) {
> LOG.debug("Exception while invoking call #" + callId + " "
> + proxyDescriptor.getProxyInfo().getString(method.getName())
> + ". Not retrying because " + retryInfo.action.reason, e);
> }
> {code}
> FailoverProxyProvider
> {code}
> public String getString(String methodName) {
> return proxy.getClass().getSimpleName() + "." + methodName
> + " over " + proxyInfo;
> }
> @Override
> public String toString() {
> return proxy.getClass().getSimpleName() + " over " + proxyInfo;
> }
> {code}
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]