[
https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma updated HDFS-7009:
--------------------------
Attachment: HDFS-7009.patch
> Active NN and standby NN have different live nodes
> --------------------------------------------------
>
> Key: HDFS-7009
> URL: https://issues.apache.org/jira/browse/HDFS-7009
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Ming Ma
> Assignee: Ming Ma
> Attachments: HDFS-7009.patch
>
>
> To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most
> cases, given DN sends HB and BR to NN regularly, if a specific RPC call
> fails, it isn't a big deal.
> However, there are cases where DN fails to register with NN during initial
> handshake due to exceptions not covered by RPC client's connection retry.
> When this happens, the DN won't talk to that NN until the DN restarts.
> {noformat}
> BPServiceActor
> public void run() {
> LOG.info(this + " starting to offer service");
> try {
> // init stuff
> try {
> // setup storage
> connectToNNAndHandshake();
> } catch (IOException ioe) {
> // Initial handshake, storage recovery or registration failed
> // End BPOfferService thread
> LOG.fatal("Initialization failed for block pool " + this, ioe);
> return;
> }
> initialized = true; // bp is initialized;
>
> while (shouldRun()) {
> try {
> offerService();
> } catch (Exception ex) {
> LOG.error("Exception in BPOfferService for " + this, ex);
> sleepAndLogInterrupts(5000, "offering service");
> }
> }
> ...
> {noformat}
> Here is an example of the call stack.
> {noformat}
> java.io.IOException: Failed on local exception: java.io.IOException: Response
> is null.; Host Details : local host is: "xxx"; destination host is:
> "yyy":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
> at org.apache.hadoop.ipc.Client.call(Client.java:1239)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Response is null.
> at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
> {noformat}
> This will create discrepancy between active NN and standby NN in terms of
> live nodes.
>
> Here is a possible scenario of missing blocks after failover.
> 1. DN A, B set up handshakes with active NN, but not with standby NN.
> 2. A block is replicated to DN A, B and C.
> 3. From standby NN's point of view, given A and B are dead nodes, the block
> is under replicated.
> 4. DN C is down.
> 5. Before active NN detects DN C is down, it fails over.
> 6. The new active NN considers the block is missing. Even though there are
> two replicas on DN A and B.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)