Posing the issue on this forum as well. Regards, Rajat
---------- Forwarded message ---------- From: Rajat Goel <rajatgoe...@gmail.com> Date: Wed, Dec 28, 2011 at 12:04 PM Subject: Re: MapReduce job failing when a node of cluster is rebooted To: common-u...@hadoop.apache.org No its not connecting, its out of the cluster. I am testing node failure scenario so I am not bothered about node going down. The issue here is that the job should succeed with remaining nodes as the replication factor is > 1, but the job is failing. Regards, Rajat On Tue, Dec 27, 2011 at 7:25 PM, alo alt <wget.n...@googlemail.com> wrote: > Did the DN you've just rebooted connecting to the NN? Mostly the > datanode daemon is'nt running, check it: > ps waux |grep "DataNode" |grep -v "grep" > > - ALex > > On Tue, Dec 27, 2011 at 2:44 PM, Rajat Goel <rajatgoe...@gmail.com> wrote: > > Yes. Hdfs and Mapred related dirs are set outside of /tmp. > > > > On Tue, Dec 27, 2011 at 6:48 PM, alo alt <wget.n...@googlemail.com> > wrote: > > > >> Hi, > >> > >> did you set the hdfs-related dirs outside of /tmp? Most *ux systems > >> clean them up on reboot. > >> > >> - Alex > >> > >> On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <rajatgoe...@gmail.com> > wrote: > >> > Hi, > >> > > >> > I have a 7-node setup (1 - Namenode/JobTracker, 6 - > >> Datanodes/TaskTrackers) > >> > running Hadoop version 0.20.203. > >> > > >> > I performed the following test: > >> > Initially cluster is running smoothly. Just before launching a > MapReduce > >> > job (about one or two minutes before), I shutdown one of the data > nodes > >> > (rebooted the machine). Then my MapReduce job starts but immediately > >> fails > >> > with following messages on stderr: > >> > > >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. > Please > >> > use org.apache.hadoop.log.metrics.EventCounter in all the > >> log4j.properties > >> > files. > >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. > Please > >> > use org.apache.hadoop.log.metrics.EventCounter in all the > >> log4j.properties > >> > files. > >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. > Please > >> > use org.apache.hadoop.log.metrics.EventCounter in all the > >> log4j.properties > >> > files. > >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. > Please > >> > use org.apache.hadoop.log.metrics.EventCounter in all the > >> log4j.properties > >> > files. > >> > NOTICE: Configuration: /device.map /region.map /url.map > >> > /data/output/2011/12/26/08 > >> > PS:192.168.100.206:11111 3600 true Notice > >> > 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for > >> > parsing the arguments. Applications should implement Tool for the > same. > >> > 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to > >> process > >> > : 24 > >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in > >> createBlockOutputStream > >> > java.io.IOException: Bad connect ack with firstBadLink as > >> > 192.168.100.5:50010 > >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block > >> > blk_-6309642664478517067_35619 > >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node: > >> > 192.168.100.7:50010 > >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in > >> createBlockOutputStream > >> > java.net.NoRouteToHostException: No route to host > >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block > >> > blk_4129088682008611797_35619 > >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in > >> createBlockOutputStream > >> > java.io.IOException: Bad connect ack with firstBadLink as > >> > 192.168.100.5:50010 > >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block > >> > blk_3596375242483863157_35619 > >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in > >> createBlockOutputStream > >> > java.io.IOException: Bad connect ack with firstBadLink as > >> > 192.168.100.5:50010 > >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block > >> > blk_724369205729364853_35619 > >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception: > >> > java.io.IOException: Unable to create new block. > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446) > >> > > >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block > >> > blk_724369205729364853_35619 bad datanode[1] nodes == null > >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations. > >> > Source file > >> > > >> > "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split" > >> > - Aborting... > >> > 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area > >> > > >> > hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292 > >> > Exception in thread "main" java.io.IOException: Bad connect ack with > >> > firstBadLink as 192.168.100.5:50010 > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446) > >> > 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file > >> > > >> > /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split > >> > : java.io.IOException: Bad connect ack with firstBadLink as > >> > 192.168.100.5:50010 > >> > java.io.IOException: Bad connect ack with firstBadLink as > >> > 192.168.100.5:50010 > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255) > >> > at > >> > > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446) > >> > > >> > > >> > - In the above logs, 192.168.100.5 is the machine I rebooted. > >> > - JobTracker's log file doesn't have any logs in the above time > period. > >> > - NameNode's log file doesn't have any exceptions or any messages > related > >> > to the above error logs. > >> > - All nodes can access each other via IP or hostnames. > >> > - ulimit values for files is set to 1024 but I don't see many > connections > >> > in CLOSE_WAIT state (Googled a bit and some ppl suggest that this > value > >> > could be a culprit in some cases) > >> > - My Hadoop configuration files have settings for no. of mappers (8), > >> > reducers (4), io.sort.mb (512 mb). Most of the other parameters have > been > >> > configured to their default values. > >> > > >> > Can someone please provide any pointers to solution of this problem? > >> > > >> > Thanks, > >> > Rajat > >> > >> > >> > >> -- > >> Alexander Lorenz > >> http://mapredit.blogspot.com > >> > >> P Think of the environment: please don't print this email unless you > >> really need to. > >> > > > > -- > Alexander Lorenz > http://mapredit.blogspot.com > > P Think of the environment: please don't print this email unless you > really need to. >