[ 
https://issues.apache.org/jira/browse/HDFS-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7441:
--------------------------
    Summary: More accurate detection for slow node in HDFS write pipeline  
(was: More accurate slow node detection in HDFS write pipeline)

> More accurate detection for slow node in HDFS write pipeline
> ------------------------------------------------------------
>
>                 Key: HDFS-7441
>                 URL: https://issues.apache.org/jira/browse/HDFS-7441
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes 
> couldn't detect the slow DN correctly.
> In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN 
> that should have been removed. But HDFS took out the healthy DN 5.6.7.8. With 
> the new pipeline, HDFS continued to take out the newly added healthy DN 
> 9.10.11.12, etc. 
> DFSClient log on 1.2.3.4
> {noformat}
> 2014-11-19 20:50:22,601 WARN [ResponseProcessor for block 
> blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: 
> DFSOutputStream ResponseProcessor exception  for block 
> blk_1157561391_1102030131492
> java.io.IOException: Bad response ERROR for block 
> blk_1157561391_1102030131492 from datanode 5.6.7.8:50010 at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
> 2014-11-19 20:50:22,977 WARN [DataStreamer for file ...  block 
> blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error 
> Recovery for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010, 
> 5.6.7.8:50010: bad datanode 5.6.7.8:50010
> {noformat}
> DN Log on 1.2.3.4
> {noformat}
> 2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> opWriteBlock blk_1157561391_1102030131492 received exception 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
> ...
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> {noformat}
> DN Log on 5.6.7.8
> {noformat}
> 2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for blk_1157561391_1102030131492
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to