[
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226339#comment-16226339
]
He Xiaoqiao commented on HDFS-12749:
------------------------------------
[~tanyuxin] could you attach some more troubleshoot or exception logs?
> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN
> restart, NN's load is very high.
> After SNN restart,DN will call BPServiceActor#reRegister method to register.
> But register RPC will get a IOException since NN is busy dealing with Block
> Report. The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException:
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/IP:Port remote=NameNode/IP:Port]; Host Details : local host is:
> "DataNode/Datanode_ip"; destination host is: "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> If encountering a IOException in BPServiceActor#register,
> scheduler.scheduleBlockReport method can't be run, and the Block Report will
> not be sent immediately.
> But NN has get the register RPC, and successfully register the DN. So NN will
> not make DN register again at next HeartBeat, which makes DN Block Report is
> not sent correctly after register.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]