[
https://issues.apache.org/jira/browse/HDFS-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
microle.dong updated HDFS-17092:
--------------------------------
Description:
when restarting namenode, we found that some datanodes did not report enough
blocks, which can lead to missing and under replicated blocks.
Datanode use multipul RPCs to report blocks, I found in the logs of the
datanode with incomplete block reporting that the first FBR attempt failed, due
to namenode error
{code:java}
2023-07-14 17:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Unsuccessfully sent block report 0x7b738b02996cd2, containing 12 storage
report(s), of which we sent 1. The reports had 633013 total blocks and used 1
RPC(s). This took 234 msec to generate and 98739 msecs for RPC and NN
processing. Got back no commands.
2023-07-14 17:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
IOException in offerService
java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002
failed on socket timeout exception: java.net.SocketTimeoutException: 60000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868
remote=x.x.x.x/x.x.x.x:9002]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
at java.lang.Thread.run(Thread.java:745){code}
the Datanode second FBR will use same lease , which will make namenode remove
the datanode lease (just as HDFS-8930) , lead to other FBR RPC failed because
no lease is left.
we should rest a new lease and try again when datanode FBR failed .
I am willing to submit a PR to fix this.
was:
when restarting namenode, we found that some datanodes did not report enough
blocks, which can lead to missing and under replicated blocks.
I found in the logs of the datanode with incomplete block reporting that the
first FBR attempt failed, due to namenode error
{code:java}
2023-07-14 11:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Unsuccessfully sent block report 0x7b738b02996cd2, containing 12 storage
report(s), of which we sent 1. The reports had 633033 total blocks and used 1
RPC(s). This took 169 msec to generate and 97730 msecs for RPC and NN
processing. Got back no commands.
2023-07-14 11:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
IOException in offerService
java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002
failed on socket timeout exception: java.net.SocketTimeoutException: 60000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868
remote=x.x.x.x/x.x.x.x:9002]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
at java.lang.Thread.run(Thread.java:745){code}
the Datanode second FBR will use same lease , which will make namenode remove
the datanode lease (just as HDFS-8930) , lead to FBR failed because no lease
is left.
we should rest a new lease and try again when datanode FBR failed .
I am willing to submit a PR to fix this.
> Datanode Full Block Report failed can lead to missing and under replicated
> blocks
> ---------------------------------------------------------------------------------
>
> Key: HDFS-17092
> URL: https://issues.apache.org/jira/browse/HDFS-17092
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: microle.dong
> Priority: Major
>
> when restarting namenode, we found that some datanodes did not report enough
> blocks, which can lead to missing and under replicated blocks.
> Datanode use multipul RPCs to report blocks, I found in the logs of the
> datanode with incomplete block reporting that the first FBR attempt failed,
> due to namenode error
>
> {code:java}
> 2023-07-14 17:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Unsuccessfully sent block report 0x7b738b02996cd2, containing 12 storage
> report(s), of which we sent 1. The reports had 633013 total blocks and used 1
> RPC(s). This took 234 msec to generate and 98739 msecs for RPC and NN
> processing. Got back no commands.
> 2023-07-14 17:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> IOException in offerService
> java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002
> failed on socket timeout exception: java.net.SocketTimeoutException: 60000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868
> remote=x.x.x.x/x.x.x.x:9002]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
> at org.apache.hadoop.ipc.Client.call(Client.java:1480)
> at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
> at java.lang.Thread.run(Thread.java:745){code}
> the Datanode second FBR will use same lease , which will make namenode
> remove the datanode lease (just as HDFS-8930) , lead to other FBR RPC
> failed because no lease is left.
> we should rest a new lease and try again when datanode FBR failed .
> I am willing to submit a PR to fix this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]