[
https://issues.apache.org/jira/browse/HDFS-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma updated HDFS-7441:
--------------------------
Summary: More accurate detection for slow node in HDFS write pipeline
(was: More accurate slow node detection in HDFS write pipeline)
> More accurate detection for slow node in HDFS write pipeline
> ------------------------------------------------------------
>
> Key: HDFS-7441
> URL: https://issues.apache.org/jira/browse/HDFS-7441
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Ming Ma
>
> A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes
> couldn't detect the slow DN correctly.
> In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN
> that should have been removed. But HDFS took out the healthy DN 5.6.7.8. With
> the new pipeline, HDFS continued to take out the newly added healthy DN
> 9.10.11.12, etc.
> DFSClient log on 1.2.3.4
> {noformat}
> 2014-11-19 20:50:22,601 WARN [ResponseProcessor for block
> blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception for block
> blk_1157561391_1102030131492
> java.io.IOException: Bad response ERROR for block
> blk_1157561391_1102030131492 from datanode 5.6.7.8:50010 at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
> 2014-11-19 20:50:22,977 WARN [DataStreamer for file ... block
> blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010,
> 5.6.7.8:50010: bad datanode 5.6.7.8:50010
> {noformat}
> DN Log on 1.2.3.4
> {noformat}
> 2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> opWriteBlock blk_1157561391_1102030131492 received exception
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
> ...
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at java.io.DataInputStream.read(DataInputStream.java:149)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> {noformat}
> DN Log on 5.6.7.8
> {noformat}
> 2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception for blk_1157561391_1102030131492
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at java.io.DataInputStream.read(DataInputStream.java:149)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)