microle.dong created HDFS-17092: ----------------------------------- Summary: Datanode Full Block Report failed can lead to missing and under replicated blocks Key: HDFS-17092 URL: https://issues.apache.org/jira/browse/HDFS-17092 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: microle.dong
when restarting namenode, we found that some datanodes did not report enough blocks, which can lead to missing and under replicated blocks. I found in the logs of the datanode with incomplete block reporting that the first FBR attempt failed, due to namenode error {code:java} 2023-07-14 11:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x7b738b02996cd2, containing 12 storage report(s), of which we sent 1. The reports had 633033 total blocks and used 1 RPC(s). This took 169 msec to generate and 97730 msecs for RPC and NN processing. Got back no commands. 2023-07-14 11:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 remote=x.x.x.x/x.x.x.x:9002]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822) at org.apache.hadoop.ipc.Client.call(Client.java:1480) at org.apache.hadoop.ipc.Client.call(Client.java:1413) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy14.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706) at java.lang.Thread.run(Thread.java:745){code} the Datanode second FBR will use same lease , which will make namenode remove the datanode lease (just as HDFS-8930) , lead to FBR failed because no lease is left. we should rest a new lease and try again when datanode FBR failed . I am willing to submit a PR to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org