All the heartbeat and timeout interval are configurable. So you don't need to decommission a host explicitly. You can configure both the namenode and the tasktracker to detect a failed host sooner. If you decommission a host, you will have to explicitly put it back into the cluster. Bill
On Sun, Apr 5, 2009 at 8:52 AM, jason hadoop <[email protected]> wrote: > From the 0.19.0 FsNameSystem.java, it looks like the timeout by default is > 2 > * 3000 + 300000 = 306000msec or 5 minutes 6 seconds. > If you have configured dfs.hosts.exclude in your hadoop-site.xml to point > to > an empty file, that actually exists, you may add the name (as used in the > slaves file) for the node to that file and run > *hadoop dfsAdmin -refreshNodes > * > The namenode will decomission that node. > > long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) * > 1000; > this.heartbeatRecheckInterval = conf.getInt( > "heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes > this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 * > heartbeatInterval; > > > On Sun, Apr 5, 2009 at 2:52 AM, Foss User <[email protected]> wrote: > > > On Sun, Apr 5, 2009 at 3:18 PM, Foss User <[email protected]> wrote: > > > I have a Hadoop cluster of 5 nodes: (1) Namenode (2) Job tracker (3) > > > First slave (4) Second Slave (5) Client from where I submit jobs > > > > > > I brought system no. 4 down by running: > > > > > > bin/hadoop-daemon.sh stop datanode > > > bin/hadoop-daemon.sh stop tasktracker > > > > > > After this I tried running my word count job again and I got this > error: > > > > > > foss...@hadoop-client:~/mcr-wordcount$ hadoop jar > > > dist/mcr-wordcount-0.1.jar com.fossist.examples.WordCountJob > > > /fossist/inputs /fossist/output7 09/04/05 > > > 15:13:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing > > > the arguments. Applications should implement Tool for the same. > > > 09/04/05 15:13:03 INFO hdfs.DFSClient: Exception in > > > createBlockOutputStream java.io.IOException: Bad connect ack with > > > firstBadLink 192.168.1.5:50010 > > > 09/04/05 15:13:03 INFO hdfs.DFSClient: Abandoning block > > > blk_-6478273736277251749_1034 > > > 09/04/05 15:13:09 INFO hdfs.DFSClient: Exception in > > > createBlockOutputStream java.net.ConnectException: Connection refused > > > 09/04/05 15:13:09 INFO hdfs.DFSClient: Abandoning block > > > blk_-7054779688981181941_1034 > > > 09/04/05 15:13:15 INFO hdfs.DFSClient: Exception in > > > createBlockOutputStream java.net.ConnectException: Connection refused > > > 09/04/05 15:13:15 INFO hdfs.DFSClient: Abandoning block > > > blk_-6231549606860519001_1034 > > > 09/04/05 15:13:21 INFO hdfs.DFSClient: Exception in > > > createBlockOutputStream java.io.IOException: Bad connect ack with > > > firstBadLink 192.168.1.5:50010 > > > 09/04/05 15:13:21 INFO hdfs.DFSClient: Abandoning block > > > blk_-7060117896593271410_1034 > > > 09/04/05 15:13:27 WARN hdfs.DFSClient: DataStreamer Exception: > > > java.io.IOException: Unable to create new block. > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722) > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > > > > > > 09/04/05 15:13:27 WARN hdfs.DFSClient: Error Recovery for block > > > blk_-7060117896593271410_1034 bad datanode[1] nodes == null > > > 09/04/05 15:13:27 WARN hdfs.DFSClient: Could not get block locations. > > > Source file > > "/tmp/hadoop-hadoop/mapred/system/job_200904042051_0011/job.jar" > > > - Aborting... > > > java.io.IOException: Bad connect ack with firstBadLink > 192.168.1.5:50010 > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780) > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) > > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > > > > > > Note that 192.168.1.5 is the Hadoop slave where I stopped datanode and > > > tasktracker. This is a serious concern for me because if I am unable > > > to run jobs after a certain node goes down, then the purpose of the > > > cluster is defeated. > > > > > > Could someone please help me in understanding whether it is a human > > > error by me or it is a problem in Hadoop? Is there any way to avoid > > > this? > > > > > > Please note that I can still read all my data in 'inputs' directory > > > using the commands like: > > > > > > foss...@hadoop-client:~/mcr-wordcount$ hadoop dfs -cat > > > /fossist/inputs/input1.txt > > > > > > Please help. > > > > > > > Here is an update. After waiting for sometime, don't know exactly how > > much, the namenode web page on port 50070 showed the down node as > > 'dead node' and I was able to run jobs again like before. Does this > > mean that Hadoop takes a while to accept that a node is dead? > > > > Is this good by design? In the first five minutes or so when Hadoop is > > in denial that a node is dead, all new jobs start failing. Is there a > > way, I as a user, can tell Hadoop to start using the other available > > other nodes in this denial period? > > > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 >
