high rate of task failures because of bad datanodes
---------------------------------------------------
Key: HADOOP-4132
URL: https://issues.apache.org/jira/browse/HADOOP-4132
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.17.1
Reporter: Christian Kunz
With 0.17 we notice a fast rate of task failures because of the same bad data
nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
After running less than 20,000 map tasks, more than 2,500 of them reported a
single certain datanode as badFirstLink, with typical exception of the form:
08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 189000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node:
xxx.yyy.zzz.ttt:50010
08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 189000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node:
xxx.yyy.zzz.ttt:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node:
xxx.yyy.zzz.ttt:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node:
xxx.yyy.zzz.ttt:50010
08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block
blk_4847638219960634858 bad datanode[2]
Exception in thread "main" java.io.IOException: Could not get block locations.
Aborting...
With several such bad datanodes the probability of jobs failing goes up a lot.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.