[jira] [Updated] (HDFS-7009) Active NN and standby NN have different live nodes

Ming Ma (JIRA) Fri, 26 Sep 2014 23:01:05 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ming Ma updated HDFS-7009:
--------------------------
    Attachment: HDFS-7009.patch

> Active NN and standby NN have different live nodes
> --------------------------------------------------
>
>                 Key: HDFS-7009
>                 URL: https://issues.apache.org/jira/browse/HDFS-7009
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7009.patch
>
>
> To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most 
> cases, given DN sends HB and BR to NN regularly, if a specific RPC call 
> fails, it isn't a big deal.
> However, there are cases where DN fails to register with NN during initial 
> handshake due to exceptions not covered by RPC client's connection retry. 
> When this happens, the DN won't talk to that NN until the DN restarts.
> {noformat}
> BPServiceActor
>   public void run() {
>     LOG.info(this + " starting to offer service");
>     try {
>       // init stuff
>       try {
>         // setup storage
>         connectToNNAndHandshake();
>       } catch (IOException ioe) {
>         // Initial handshake, storage recovery or registration failed
>         // End BPOfferService thread
>         LOG.fatal("Initialization failed for block pool " + this, ioe);
>         return;
>       }
>       initialized = true; // bp is initialized;
>       
>       while (shouldRun()) {
>         try {
>           offerService();
>         } catch (Exception ex) {
>           LOG.error("Exception in BPOfferService for " + this, ex);
>           sleepAndLogInterrupts(5000, "offering service");
>         }
>       }
> ...
> {noformat}
> Here is an example of the call stack.
> {noformat}
> java.io.IOException: Failed on local exception: java.io.IOException: Response 
> is null.; Host Details : local host is: "xxx"; destination host is: 
> "yyy":8030;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1239)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Response is null.
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
> {noformat}
> This will create discrepancy between active NN and standby NN in terms of 
> live nodes.
>  
> Here is a possible scenario of missing blocks after failover.
> 1. DN A, B set up handshakes with active NN, but not with standby NN.
> 2. A block is replicated to DN A, B and C.
> 3. From standby NN's point of view, given A and B are dead nodes, the block 
> is under replicated.
> 4. DN C is down.
> 5. Before active NN detects DN C is down, it fails over.
> 6. The new active NN considers the block is missing. Even though there are 
> two replicas on DN A and B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7009) Active NN and standby NN have different live nodes

Reply via email to