[jira] [Commented] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure

2021-02-17 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286111#comment-17286111
 ] 

Kihwal Lee commented on HDFS-15837:
---

The successful pipeline update with 3072 bytes was at 03:42:40.  The 
replication of a 3544 byte block and corruption detection was 4 minutes later. 
The client must have wrote more after the recovery and closed the file. 
Replication happens only after closing the file.  It looks like the file was 
closed at around 03:46:40.

I don't think the size difference observed on datanodes are anomaly. Clients 
usually write more data after a pipeline recovery. Is the file still existing? 
What's the size you see when you do "hadoop fs -ls /pat_to_the_file"?  Is the 
block appearing as missing on the UI? 

If there was a corruption issue, I don't think it was because of the size.  The 
fact that the replication happened means the NN was agreeing with the size 
3544. If the datanode had unexpected size of the replica, the transfer wouldn't 
have happened.  The corruption report came from the destination node for the 
replication, 172.21.234.181.  The replication source does not perform any 
checks and there is no feedback mechanism for reporting error back to the 
source node. So even if a corruption was detected during the replication, only 
the target (172.21.234.181) would notice it.  Check the log of 172.21.234.181 
at around 03:46:43. It must have detected a corruption and logged it.

The replica on the source (172.21.226.26) must be the only copy and if it is 
corrupt, it will show up as missing. If the file still exists, you can get to 
the node and run "hdfs debug verifyMeta" command against the block file and the 
meta file to manually check the integrity.

If it is really corrupt, it could be due to 172.21.226.26's local issue.  
Extreme load can cause write failures and pipeline updates, but we haven't seen 
such condition causing data corruption.

> Incorrect bytes causing block corruption after update pipeline and recovery 
> failure
> ---
>
> Key: HDFS-15837
> URL: https://issues.apache.org/jira/browse/HDFS-15837
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.8.5
>Reporter: Udit Mehrotra
>Priority: Major
>
> We are seeing cases on HDFS blocks being marked as *bad* after the initial 
> block receive fails during *update pipeline* followed by *HDFS* *recovery* 
> for the block failing as well. Here is the life cycle of a block 
> *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:
> 1. The block creation starts at name node as part of *update pipeline* 
> process:
> {noformat}
> 2021-01-25 03:41:17,335 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 
> 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) 
> success{noformat}
> 2. The block receiver on the data node fails with a socket timeout exception, 
> and so do the retries:
> {noformat}
> 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> IOException in BlockReceiver.run(): 
> java.io.IOException: Connection 

[jira] [Commented] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure

2021-02-16 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285488#comment-17285488
 ] 

Udit Mehrotra commented on HDFS-15837:
--

[~kihwal] hadoop version is *2.8.5* running on EMR. So the block it seems was 
successfully written with size *3554* and ultimately reported as a bad block. 
Can you point to some of the probable causes for this that you have seen ? 
Could this be due to high network load on the particular data node ?

Also, is there something we can do to avoid recurrence of this. At this time we 
are investigating a strategy to distribute the network load to avoid data node 
from being overwhelmed. But we still want to understand the root cause for this.

> Incorrect bytes causing block corruption after update pipeline and recovery 
> failure
> ---
>
> Key: HDFS-15837
> URL: https://issues.apache.org/jira/browse/HDFS-15837
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.8.5
>Reporter: Udit Mehrotra
>Priority: Major
>
> We are seeing cases on HDFS blocks being marked as *bad* after the initial 
> block receive fails during *update pipeline* followed by *HDFS* *recovery* 
> for the block failing as well. Here is the life cycle of a block 
> *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:
> 1. The block creation starts at name node as part of *update pipeline* 
> process:
> {noformat}
> 2021-01-25 03:41:17,335 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 
> 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) 
> success{noformat}
> 2. The block receiver on the data node fails with a socket timeout exception, 
> and so do the retries:
> {noformat}
> 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> IOException in BlockReceiver.run(): 
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
> at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
> at 
> 

[jira] [Commented] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure

2021-02-16 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285448#comment-17285448
 ] 

Kihwal Lee commented on HDFS-15837:
---

What is the version of Hadoop you are using?  We've seen similar issues with 
varying causes.  Was this file closed successfully with the size 3072?

> Incorrect bytes causing block corruption after update pipeline and recovery 
> failure
> ---
>
> Key: HDFS-15837
> URL: https://issues.apache.org/jira/browse/HDFS-15837
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.8.5
>Reporter: Udit Mehrotra
>Priority: Major
>
> We are seeing cases on HDFS blocks being marked as *bad* after the initial 
> block receive fails during *update pipeline* followed by *HDFS* *recovery* 
> for the block failing as well. Here is the life cycle of a block 
> *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:
> 1. The block creation starts at name node as part of *update pipeline* 
> process:
> {noformat}
> 2021-01-25 03:41:17,335 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 
> 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) 
> success{noformat}
> 2. The block receiver on the data node fails with a socket timeout exception, 
> and so do the retries:
> {noformat}
> 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> IOException in BlockReceiver.run(): 
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
> at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
>