I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 ) but
not sure this one is exactly the same scenario.
Hadoop - 0.19.2
The client side DFSClient fails to write when few of the DN in a grid goes
down. I see this error :
***************************
2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
ResponseProcessor exception for block
blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
block blk_30289322
54678171367_1462691 from datanode 10.201.9.225:50010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
..201.9.225:50010
2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
ResponseProcessor exception for block
blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
block blk_-661912
3912237837733_1462799 from datanode 10.201.9.225:50010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13
13:45:37,433 WARN DFSClient | Error Recovery for block
blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
10.201.9.225:50010: bad datanode 10.201.9.225:50010
***************************
The only way I could get my client program to write successfully to the DFS was
to re-start it.
Any suggestions how to get around this problem on the client side ? As I
understood, the DFSClient APIs will take care of situations like this and the
clients don't need to worry about if some of the DN goes down.
Also, the replication factor is 3 in my setup and there are 10 DN (out of which
TWO went down)
Thanks!
Arvind