[jira] [Created] (HDFS-6721) Handle the situation where SBN is in zombie state

Ming Ma (JIRA) Mon, 21 Jul 2014 22:59:19 -0700

Ming Ma created HDFS-6721:
-----------------------------

             Summary: Handle the situation where SBN is in zombie state
                 Key: HDFS-6721
                 URL: https://issues.apache.org/jira/browse/HDFS-6721
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma



Issue:

In HA setup, when the first NN in the service list is the SBN, RPC client will 
always try the first NN, get StandbyException and then failover to the second 
NN in the service list, which is the active NN.

This works pretty well when SBN is heathy. It also works well when SBN isn't 
running, for example, during rolling upgrade; in which case the client will get 
"java.net.ConnectException: Connection refused" right away.

Suggestions?
When SBN is in some zombie state, for example, machine is low in memory, SBN 
still runs, but can't do much, you will get ConnectTimeoutException.

{noformat}
14/07/21 04:12:42 DEBUG ipc.Client: Connecting to hadoop-foo-nn1/a.b.c.d:8020
14/07/21 04:13:02 DEBUG ipc.Client: closing ipc connection to 
hadoop-foo-nn1/a.b.c.d:8020: 20000 millis timeout while waiting for channel to 
be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending 
remote=hadoop-foo-nn1/a.b.c.d:8020]
{noformat}

When this happens, each RPC client connection will waste 20 seconds before 
failover. That ends up slowing down MR jobs significantly.


Solution:
 
Perhaps this is the responsibility of external monitoring service for HDFS; it 
can detect machine in zombie state and restart the machine.

Can we have HDFS handle this automatically? States in ZK and DNs point to 
correct active NN. For example, Task JVM can get the hint for active NN from 
the DN on the local machine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HDFS-6721) Handle the situation where SBN is in zombie state

Reply via email to