[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages

Konstantin Shvachko (JIRA) Sun, 17 Apr 2016 13:29:44 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244890#comment-15244890
 ]


Konstantin Shvachko commented on HDFS-10301:
--------------------------------------------

My DN has the following six storages:
{code}
DS-019298c0-aab9-45b4-8b62-95d6809380ff:NORMAL:kkk.sss.22.105
DS-0ea95238-d9ba-4f62-ae18-fdb9333465ce:NORMAL:kkk.sss.22.105
DS-191fc04b-90be-42c9-b6fb-fdd1517bf4c7:NORMAL:kkk.sss.22.105
DS-4a2e91c7-cdf0-408b-83a6-286c3534d673:NORMAL:kkk.sss.22.105
DS-5b2941f7-2b52-45a8-b135-dcbe488cc65b:NORMAL:kkk.sss.22.105
DS-6849f605-fd83-462d-97c3-cb6949383f7e:NORMAL:kkk.sss.22.105
{code}
Here are the logs for its block reports. All throw the same exception, but I 
pasted it only once.
{code}
2016-04-12 22:31:58,931 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x283d25423fb64d,  containing 6 storage 
report(s), of which we sent 0. The reports had 81565 total blocks and used 0 
RPC(s). This took 19 msec to generate and 60078 msecs for RPC and NN 
processing. Got back no commands.
2016-04-12 22:31:58,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From 
dn-hcl1264.my.cluster.com/kkk.sss.22.105 to namenode-ha1.my.cluster.com:9000 
failed on socket timeout exception: java.net.SocketTimeoutException: 60000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/kkk.sss.22.105:10101 
remote=namenode-ha1.my.cluster.com/10.150.1.56:9000]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:750)
        at org.apache.hadoop.ipc.Client.call(Client.java:1473)
        at org.apache.hadoop.ipc.Client.call(Client.java:1400)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy12.blockReport(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:178)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:494)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:732)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:872)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/kkk.sss.22.105:10101 
remote=namenode-ha1.my.cluster.com/10.150.1.56:9000]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)

2016-04-12 22:32:59,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x283d334a100bde,  containing 6 storage 
report(s), of which we sent 0. The reports had 81565 total blocks and used 0 
RPC(s). This took 17 msec to generate and 60066 msecs for RPC and NN 
processing. Got back no commands.
2016-04-12 22:33:59,311 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x283d414ae386b2,  containing 6 storage 
report(s), of which we sent 0. The reports had 81565 total blocks and used 0 
RPC(s). This took 16 msec to generate and 60055 msecs for RPC and NN 
processing. Got back no commands.
2016-04-12 22:34:59,409 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x283d4f4a605732,  containing 6 storage 
report(s), of which we sent 0. The reports had 81565 total blocks and used 0 
RPC(s). This took 16 msec to generate and 60032 msecs for RPC and NN 
processing. Got back no commands.
2016-04-12 22:35:59,585 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x283d5d4ca9bf5c,  containing 6 storage 
report(s), of which we sent 0. The reports had 81565 total blocks and used 0 
RPC(s). This took 15 msec to generate and 60040 msecs for RPC and NN 
processing. Got back no commands.
2016-04-12 22:36:47,307 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Successfully sent block report 0x283d6b4ac1b50a,  containing 6 storage 
report(s), of which we sent 6. The reports had 81565 total blocks and used 1 
RPC(s). This took 17 msec to generate and 47664 msecs for RPC and NN 
processing. Got back one command: FinalizeCommand/5.
{code}

I'll attache the logs for processing these six block reports on the NameNode. 
Each color represents single report. You can see how the colors are 
interleaving, and zombie storage messages in the middle.

> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Priority: Critical
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages

Reply via email to