[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364731#comment-16364731
 ] 

Kihwal Lee commented on HDFS-12749:
-----------------------------------

I see. The old registration looks identical to the new one, so NN still accepts 
it.  About catching IOException: if the registration fails with a 
RemoteException that is not RetriableException, the actor may need to stop 
instead of retrying. Also, if we choose to blank something out before trying to 
re-register to re-trigger registration, we should avoid hitting something like 
HDFS-8995.

> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
>                 Key: HDFS-12749
>                 URL: https://issues.apache.org/jira/browse/HDFS-12749
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: TanYuxin
>            Priority: Major
>         Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart´╝îDN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>    * Register one bp with the corresponding NameNode
>    * <p>
>    * The bpDatanode needs to register with the namenode on startup in order
>    * 1) to report which storage it is serving now and 
>    * 2) to receive a registrationID
>    *  
>    * issued by the namenode to recognize registered datanodes.
>    * 
>    * @param nsInfo current NamespaceInfo
>    * @see FSNamesystem#registerDatanode(DatanodeRegistration)
>    * @throws IOException
>    */
>   void register(NamespaceInfo nsInfo) throws IOException {
>     // The handshake() phase loaded the block pool storage
>     // off disk - so update the bpRegistration object from that info
>     DatanodeRegistration newBpRegistration = bpos.createRegistration();
>     LOG.info(this + " beginning handshake with NN");
>     while (shouldRun()) {
>       try {
>         // Use returned registration from namenode with updated fields
>         newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
>         newBpRegistration.setNamespaceInfo(nsInfo);
>         bpRegistration = newBpRegistration;
>         break;
>       } catch(EOFException e) {  // namenode might have just restarted
>         LOG.info("Problem connecting to server: " + nnAddr + " :"
>             + e.getLocalizedMessage());
>         sleepAndLogInterrupts(1000, "connecting to server");
>       } catch(SocketTimeoutException e) {  // namenode is busy
>         LOG.info("Problem connecting to server: " + nnAddr);
>         sleepAndLogInterrupts(1000, "connecting to server");
>       }
>     }
>     
>     LOG.info("Block pool " + this + " successfully registered with NN");
>     bpos.registrationSucceeded(this, bpRegistration);
>     // random short delay - helps scatter the BR from all DNs
>     scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to re-register again



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to