[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723497#comment-14723497
 ] 

Kihwal Lee commented on HDFS-8995:
----------------------------------

{noformat}
2018-09-26 12:15:08,497 WARN datanode.DataNode: Block pool BP-xxx (Datanode 
Uuid xxx) service to the-namenode.elephantland.gov/
10.2.3.4:8020 is shutting down
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnregisteredNodeException):
 Data node DatanodeRegistration(0.0.0.0, datanodeUuid=xxx, infoPort=, 
infoSecurePort=, ipcPort=, storageInfo=lv=-56;cid=CID-1xxx;c=xxx)
 is attempting to report storage ID abc. Node 10.100.100.100:100 (actual ip 
addr) is expected to serve this storage.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getDatanode(DatanodeManager.java:483)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:3094)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processIncrementalBlockReport(FSNamesystem.java:6406)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReceivedAndDeleted(NameNodeRpcServer.java:1200)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolServerSideTranslatorPB.java:215)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26632)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)

        at org.apache.hadoop.ipc.Client.call(Client.java:1451)
        at org.apache.hadoop.ipc.Client.call(Client.java:1382)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy14.blockReceivedAndDeleted(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:240)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reportReceivedDeletedBlocks(BPServiceActor.java:289)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:692)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

The namenode is saying the incremental block report came with a DN registration 
containing address 0.0.0.0. This is what DN does on registration, not for block 
report. During registration, NN sends back with the address it saw and the DN 
uses the registration from the NN from that point on. So subsequent calls 
contain the address of its external interface.  This exception trace suggests 
there is a bug in exception handling and re-registration.

> Flaw in registration bookeeping can make DN die on reconnect
> ------------------------------------------------------------
>
>                 Key: HDFS-8995
>                 URL: https://issues.apache.org/jira/browse/HDFS-8995
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Priority: Critical
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to