Hi,

I have a 7-node setup (1 - Namenode/JobTracker, 6 - Datanodes/TaskTrackers)
running Hadoop version 0.20.203.

I performed the following test:
Initially cluster is running smoothly. Just before launching a MapReduce
job (about one or two minutes before), I shutdown one of the data nodes
(rebooted the machine). Then my MapReduce job starts but immediately fails
with following messages on stderr:

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
NOTICE: Configuration: /device.map    /region.map    /url.map
/data/output/2011/12/26/08
 PS:192.168.100.206:11111    3600    true    Notice
11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to process
: 24
11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
blk_-6309642664478517067_35619
11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
192.168.100.7:50010
11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.NoRouteToHostException: No route to host
11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
blk_4129088682008611797_35619
11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
blk_3596375242483863157_35619
11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
blk_724369205729364853_35619
11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)

11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
blk_724369205729364853_35619 bad datanode[1] nodes == null
11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
Source file
"/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
- Aborting...
11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
Exception in thread "main" java.io.IOException: Bad connect ack with
firstBadLink as 192.168.100.5:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
: java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)


- In the above logs, 192.168.100.5 is the machine I rebooted.
- JobTracker's log file doesn't have any logs in the above time period.
- NameNode's log file doesn't have any exceptions or any messages related
to the above error logs.
- All nodes can access each other via IP or hostnames.
- ulimit values for files is set to 1024 but I don't see many connections
in CLOSE_WAIT state (Googled a bit and some ppl suggest that this value
could be a culprit in some cases)
- My Hadoop configuration files have settings for no. of mappers (8),
reducers (4), io.sort.mb (512 mb). Most of the other parameters have been
configured to their default values.

Can someone please provide any pointers to solution of this problem?

Thanks,
Rajat

Reply via email to