Ming Ma created HDFS-7441:
-----------------------------
Summary: More accurate slow node detection in HDFS write pipeline
Key: HDFS-7441
URL: https://issues.apache.org/jira/browse/HDFS-7441
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Ming Ma
A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes
couldn't detect the slow DN correctly.
In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN that
should have been removed. But HDFS took out the healthy DN 5.6.7.8. With the
new pipeline, HDFS continued to take out the newly added healthy DN 9.10.11.12,
etc.
DFSClient log on 1.2.3.4
{noformat}
2014-11-19 20:50:22,601 WARN [ResponseProcessor for block
blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
ResponseProcessor exception for block blk_1157561391_1102030131492
java.io.IOException: Bad response ERROR for block blk_1157561391_1102030131492
from datanode 5.6.7.8:50010 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
2014-11-19 20:50:22,977 WARN [DataStreamer for file ... block
blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error Recovery
for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010, 5.6.7.8:50010: bad
datanode 5.6.7.8:50010
{noformat}
DN Log on 1.2.3.4
{noformat}
2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock blk_1157561391_1102030131492 received exception
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
...
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
{noformat}
DN Log on 5.6.7.8
{noformat}
2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Exception for blk_1157561391_1102030131492
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)