[
https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Udit Mehrotra updated HDFS-15837:
---------------------------------
Description:
We are seeing cases on HDFS blocks being marked as *bad* after the initial
block receive fails during *update pipeline* followed by *HDFS* *recovery* for
the block failing as well. Here is the life cycle of a block
*{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:
1. The block creation starts at name node as part of *update pipeline* process:
{noformat}
2021-01-25 03:41:17,335 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on
8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578)
success{noformat}
2. The block receiver on the data node fails with a socket timeout exception,
and so do the retries:
{noformat}
2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
at java.lang.Thread.run(Thread.java:748)
2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode
(PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
IOException in BlockReceiver.run():
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
at java.lang.Thread.run(Thread.java:748)
2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
at java.lang.Thread.run(Thread.java:748){noformat}
3. The data node terminates the receiving of block and initiates recovery:
{noformat}
2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
terminating
2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:60430 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Receiving
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 src:
/172.21.240.31:60430 dest: /172.21.226.26:50010
2021-01-25 03:42:22,529 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:60430 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover
RBW replica
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat}
4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with:
{noformat}
2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:45918 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): Exception
for BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
at java.lang.Thread.run(Thread.java:748)
2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:45918 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]):
opWriteBlock BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
received exception java.nio.channels.ClosedByInterruptException
2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:45918 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]):
ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK
operation src: /172.21.240.31:45918 dst: /172.21.226.26:50010
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
at java.lang.Thread.run(Thread.java:748){noformat}
5. This is followed by the resetting of bytes on the data node to {{3072}} the
correct length, which is an expected behavior based on
https://issues.apache.org/jira/browse/hdfs-11472:
{noformat}
2021-01-25 03:42:40,007 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:60430 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At
ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten,
blk_1342440165_272630578, RBW
getNumBytes() = 3072
getBytesOnDisk() = 2678
getVisibleLength()= 3072
getVolume() = /mnt/hdfs/current
getBlockFile() =
/mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
bytesAcked=3072
bytesOnDisk=2678
2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client
DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at
/172.21.244.117:34354 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): Receiving
BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 src:
/172.21.244.117:34354 dest: /172.21.226.26:50010
2021-01-25 03:42:40,013 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
/172.21.240.31:60430 [Receiving block
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Resetting
bytesOnDisk to match blockDataLength (=3072) for replica ReplicaBeingWritten,
blk_1342440165_272630578, RBW
getNumBytes() = 3072
getBytesOnDisk() = 2678
getVisibleLength()= 3072
getVolume() = /mnt/hdfs/current
getBlockFile() =
/mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
bytesAcked=3072
bytesOnDisk=2678{noformat}
6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for the
block by increasing the generation:
{noformat}
2021-01-25 03:42:40,034 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on
8020): updatePipeline(blk_1342440165_272630578, newGS=272645069,
newLength=3072, newNodes=[172.21.226.26:50010],
client=DFSClient_NONMAPREDUCE_390070548_42)
2021-01-25 03:42:40,034 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on
8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069)
success{noformat}
Notice how the correct *{{newLength}}* in sent in the above log from Name Node
i.e. *{{3072}}*.
7. The data node then receives the new generation block of this block correctly:
{noformat}
2021-01-25 03:46:40,066 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069,
type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: /172.21.226.26:50010,
bytes: 3554, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_390070548_42,
offset: 0, srvID: 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration:
240052941459
2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069,
type=LAST_IN_PIPELINE): PacketResponder:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069,
type=LAST_IN_PIPELINE terminating{noformat}
But here notice how bytes: *3554* which is different from what was sent from
the Name Node i.e. *3072*.
8. The data node then does the successful downstream replication into another
node for this block:
{noformat}
2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(BP-908477295-172.21.224.178-1606768078949 heartbeating to
/172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010,
datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075,
infoSecurePort=0, ipcPort=50020,
storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949)
Starting thread to transfer
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to
172.21.234.181:50010
2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58):
DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069
(numBytes=3554) to /172.21.234.181:5001{noformat}
9. However, immediately after successful transmitting and replication the
namenode declares the block as corrupt:
{noformat}
2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server
handler 46 on 8020): *DIR* reportBadBlocks for block:
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on datanode:
172.21.226.26:50010{noformat}
> Incorrect bytes causing block corruption after update pipeline and recovery
> failure
> -----------------------------------------------------------------------------------
>
> Key: HDFS-15837
> URL: https://issues.apache.org/jira/browse/HDFS-15837
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, hdfs
> Affects Versions: 2.8.5
> Reporter: Udit Mehrotra
> Priority: Major
>
> We are seeing cases on HDFS blocks being marked as *bad* after the initial
> block receive fails during *update pipeline* followed by *HDFS* *recovery*
> for the block failing as well. Here is the life cycle of a block
> *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:
> 1. The block creation starts at name node as part of *update pipeline*
> process:
>
> {noformat}
> 2021-01-25 03:41:17,335 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on
> 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578)
> success{noformat}
> 2. The block receiver on the data node fails with a socket timeout exception,
> and so do the retries:
>
> {noformat}
> 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
> PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode
> (PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
> IOException in BlockReceiver.run():
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
> at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
> PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
> at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
> at java.lang.Thread.run(Thread.java:748){noformat}
> 3. The data node terminates the receiving of block and initiates recovery:
>
>
> {noformat}
> 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]):
> PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> terminating
> 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:60430 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]):
> Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
> src: /172.21.240.31:60430 dest: /172.21.226.26:50010
> 2021-01-25 03:42:22,529 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:60430 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover
> RBW replica
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat}
> 4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with:
>
> {noformat}
> 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:45918 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]):
> Exception for
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
> java.nio.channels.ClosedByInterruptException
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
> at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:45918 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]):
> opWriteBlock
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 received
> exception java.nio.channels.ClosedByInterruptException
> 2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:45918 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]):
> ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK
> operation src: /172.21.240.31:45918 dst: /172.21.226.26:50010
> java.nio.channels.ClosedByInterruptException
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
> at java.lang.Thread.run(Thread.java:748){noformat}
>
>
> 5. This is followed by the resetting of bytes on the data node to {{3072}}
> the correct length, which is an expected behavior based on
> https://issues.apache.org/jira/browse/hdfs-11472:
>
> {noformat}
> 2021-01-25 03:42:40,007 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:60430 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At
> ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten,
> blk_1342440165_272630578, RBW
> getNumBytes() = 3072
> getBytesOnDisk() = 2678
> getVisibleLength()= 3072
> getVolume() = /mnt/hdfs/current
> getBlockFile() =
> /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
> bytesAcked=3072
> bytesOnDisk=2678
> 2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client
> DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at
> /172.21.244.117:34354 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]):
> Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220
> src: /172.21.244.117:34354 dest: /172.21.226.26:50010
> 2021-01-25 03:42:40,013 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at
> /172.21.240.31:60430 [Receiving block
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]):
> Resetting bytesOnDisk to match blockDataLength (=3072) for replica
> ReplicaBeingWritten, blk_1342440165_272630578, RBW
> getNumBytes() = 3072
> getBytesOnDisk() = 2678
> getVisibleLength()= 3072
> getVolume() = /mnt/hdfs/current
> getBlockFile() =
> /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
> bytesAcked=3072
> bytesOnDisk=2678{noformat}
>
> 6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for
> the block by increasing the generation:
>
> {noformat}
> 2021-01-25 03:42:40,034 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on
> 8020): updatePipeline(blk_1342440165_272630578, newGS=272645069,
> newLength=3072, newNodes=[172.21.226.26:50010],
> client=DFSClient_NONMAPREDUCE_390070548_42)
> 2021-01-25 03:42:40,034 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on
> 8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069)
> success{noformat}
> Notice how the correct *{{newLength}}* in sent in the above log from Name
> Node i.e. *{{3072}}*.
>
> 7. The data node then receives the new generation block of this block
> correctly:
> {noformat}
> 2021-01-25 03:46:40,066 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069,
> type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest:
> /172.21.226.26:50010, bytes: 3554, op: HDFS_WRITE, cliID:
> DFSClient_NONMAPREDUCE_390070548_42, offset: 0, srvID:
> 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration:
> 240052941459
> 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069,
> type=LAST_IN_PIPELINE): PacketResponder:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069,
> type=LAST_IN_PIPELINE terminating{noformat}
> But here notice how bytes: *3554* which is different from what was sent from
> the Name Node i.e. *3072*.
>
> 8. The data node then does the successful downstream replication into another
> node for this block:
> {noformat}
> 2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (BP-908477295-172.21.224.178-1606768078949 heartbeating to
> /172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010,
> datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075,
> infoSecurePort=0, ipcPort=50020,
> storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949)
> Starting thread to transfer
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to
> 172.21.234.181:50010
> 2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58):
> DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069
> (numBytes=3554) to /172.21.234.181:5001{noformat}
> 9. However, immediately after successful transmitting and replication the
> namenode declares the block as corrupt:
> {noformat}
> 2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server
> handler 46 on 8020): *DIR* reportBadBlocks for block:
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on
> datanode: 172.21.226.26:50010{noformat}
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]