[ 
https://issues.apache.org/jira/browse/HDFS-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7441:
--------------------------
    Description: 
A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes 
couldn't detect the slow DN correctly. Detection for "slow node" might not be 
specific to HDFS write pipeline. When a node is slow due to OS/HW issue, it is 
better exclude it from HDFS read or write and as well as YARN/MR operations. 
The issue here is the write operation takes a long time for a given block. We 
need some mechanism to detect such node quickly for high throughput 
applications.

In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN that 
should have been removed. But HDFS took out the healthy DN 5.6.7.8. With the 
new pipeline, HDFS continued to take out the newly added healthy DN 9.10.11.12, 
etc. 

DFSClient log on 1.2.3.4
{noformat}
2014-11-19 20:50:22,601 WARN [ResponseProcessor for block 
blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block blk_1157561391_1102030131492
java.io.IOException: Bad response ERROR for block blk_1157561391_1102030131492 
from datanode 5.6.7.8:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
2014-11-19 20:50:22,977 WARN [DataStreamer for file ...  block 
blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010, 5.6.7.8:50010: bad 
datanode 5.6.7.8:50010
{noformat}

DN Log on 1.2.3.4
{noformat}
2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock blk_1157561391_1102030131492 received exception 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
...
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
{noformat}


DN Log on 5.6.7.8
{noformat}
2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for blk_1157561391_1102030131492
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes 
couldn't detect the slow DN correctly.

In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN that 
should have been removed. But HDFS took out the healthy DN 5.6.7.8. With the 
new pipeline, HDFS continued to take out the newly added healthy DN 9.10.11.12, 
etc. 

DFSClient log on 1.2.3.4
{noformat}
2014-11-19 20:50:22,601 WARN [ResponseProcessor for block 
blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block blk_1157561391_1102030131492
java.io.IOException: Bad response ERROR for block blk_1157561391_1102030131492 
from datanode 5.6.7.8:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
2014-11-19 20:50:22,977 WARN [DataStreamer for file ...  block 
blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010, 5.6.7.8:50010: bad 
datanode 5.6.7.8:50010
{noformat}

DN Log on 1.2.3.4
{noformat}
2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock blk_1157561391_1102030131492 received exception 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
...
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
{noformat}


DN Log on 5.6.7.8
{noformat}
2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for blk_1157561391_1102030131492
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)
{noformat}


> More accurate detection for slow node in HDFS write pipeline
> ------------------------------------------------------------
>
>                 Key: HDFS-7441
>                 URL: https://issues.apache.org/jira/browse/HDFS-7441
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes 
> couldn't detect the slow DN correctly. Detection for "slow node" might not be 
> specific to HDFS write pipeline. When a node is slow due to OS/HW issue, it 
> is better exclude it from HDFS read or write and as well as YARN/MR 
> operations. The issue here is the write operation takes a long time for a 
> given block. We need some mechanism to detect such node quickly for high 
> throughput applications.
> In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN 
> that should have been removed. But HDFS took out the healthy DN 5.6.7.8. With 
> the new pipeline, HDFS continued to take out the newly added healthy DN 
> 9.10.11.12, etc. 
> DFSClient log on 1.2.3.4
> {noformat}
> 2014-11-19 20:50:22,601 WARN [ResponseProcessor for block 
> blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: 
> DFSOutputStream ResponseProcessor exception  for block 
> blk_1157561391_1102030131492
> java.io.IOException: Bad response ERROR for block 
> blk_1157561391_1102030131492 from datanode 5.6.7.8:50010 at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
> 2014-11-19 20:50:22,977 WARN [DataStreamer for file ...  block 
> blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error 
> Recovery for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010, 
> 5.6.7.8:50010: bad datanode 5.6.7.8:50010
> {noformat}
> DN Log on 1.2.3.4
> {noformat}
> 2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> opWriteBlock blk_1157561391_1102030131492 received exception 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
> ...
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> {noformat}
> DN Log on 5.6.7.8
> {noformat}
> 2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for blk_1157561391_1102030131492
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to