MapReduce job failing when a node of cluster is rebooted

Rajat Goel Tue, 03 Jan 2012 23:47:20 -0800

Posing the issue on this forum as well.

Regards,
Rajat


---------- Forwarded message ----------
From: Rajat Goel <rajatgoe...@gmail.com>
Date: Wed, Dec 28, 2011 at 12:04 PM
Subject: Re: MapReduce job failing when a node of cluster is rebooted
To: common-u...@hadoop.apache.org


No its not connecting, its out of the cluster. I am testing node failure
scenario so I am not bothered about node going down.

The issue here is that the job should succeed with remaining nodes
as the replication factor is > 1, but the job is failing.

Regards,
Rajat


On Tue, Dec 27, 2011 at 7:25 PM, alo alt <wget.n...@googlemail.com> wrote:

> Did the DN you've just rebooted connecting to the NN? Mostly the
> datanode daemon is'nt running, check it:
> ps waux |grep "DataNode" |grep -v "grep"
>
> - ALex
>
> On Tue, Dec 27, 2011 at 2:44 PM, Rajat Goel <rajatgoe...@gmail.com> wrote:
> > Yes. Hdfs and Mapred related dirs are set outside of /tmp.
> >
> > On Tue, Dec 27, 2011 at 6:48 PM, alo alt <wget.n...@googlemail.com>
> wrote:
> >
> >> Hi,
> >>
> >> did you set the hdfs-related dirs outside of /tmp? Most *ux systems
> >> clean them up on reboot.
> >>
> >> - Alex
> >>
> >> On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <rajatgoe...@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I have a 7-node setup (1 - Namenode/JobTracker, 6 -
> >> Datanodes/TaskTrackers)
> >> > running Hadoop version 0.20.203.
> >> >
> >> > I performed the following test:
> >> > Initially cluster is running smoothly. Just before launching a
> MapReduce
> >> > job (about one or two minutes before), I shutdown one of the data
> nodes
> >> > (rebooted the machine). Then my MapReduce job starts but immediately
> >> fails
> >> > with following messages on stderr:
> >> >
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > NOTICE: Configuration: /device.map    /region.map    /url.map
> >> > /data/output/2011/12/26/08
> >> >  PS:192.168.100.206:11111    3600    true    Notice
> >> > 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
> >> > parsing the arguments. Applications should implement Tool for the
> same.
> >> > 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to
> >> process
> >> > : 24
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
> >> > blk_-6309642664478517067_35619
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
> >> > 192.168.100.7:50010
> >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.net.NoRouteToHostException: No route to host
> >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
> >> > blk_4129088682008611797_35619
> >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
> >> > blk_3596375242483863157_35619
> >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
> >> > blk_724369205729364853_35619
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
> >> > java.io.IOException: Unable to create new block.
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> >
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
> >> > blk_724369205729364853_35619 bad datanode[1] nodes == null
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
> >> > Source file
> >> >
> >>
> "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
> >> > - Aborting...
> >> > 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
> >> >
> >>
> hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
> >> > Exception in thread "main" java.io.IOException: Bad connect ack with
> >> > firstBadLink as 192.168.100.5:50010
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> > 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
> >> >
> >>
> /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
> >> > : java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> >
> >> >
> >> > - In the above logs, 192.168.100.5 is the machine I rebooted.
> >> > - JobTracker's log file doesn't have any logs in the above time
> period.
> >> > - NameNode's log file doesn't have any exceptions or any messages
> related
> >> > to the above error logs.
> >> > - All nodes can access each other via IP or hostnames.
> >> > - ulimit values for files is set to 1024 but I don't see many
> connections
> >> > in CLOSE_WAIT state (Googled a bit and some ppl suggest that this
> value
> >> > could be a culprit in some cases)
> >> > - My Hadoop configuration files have settings for no. of mappers (8),
> >> > reducers (4), io.sort.mb (512 mb). Most of the other parameters have
> been
> >> > configured to their default values.
> >> >
> >> > Can someone please provide any pointers to solution of this problem?
> >> >
> >> > Thanks,
> >> > Rajat
> >>
> >>
> >>
> >> --
> >> Alexander Lorenz
> >> http://mapredit.blogspot.com
> >>
> >> P Think of the environment: please don't print this email unless you
> >> really need to.
> >>
>
>
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> P Think of the environment: please don't print this email unless you
> really need to.
>

MapReduce job failing when a node of cluster is rebooted

Reply via email to