Ronan, A couple of simple things to ensure first:
1. Make sure the firewall isn't the one at fault here. Best to disable firewall if you do not need it, or carefully configure the rules to allow in/out traffic over chosen ports. 2. Ensure that the hostnames fs.default.name and mapred.job.tracker bind to, are external IP-resolving hostnames and not localhost (loopback interface bound) addresses. On Tue, Jul 17, 2012 at 12:05 AM, Ronan Lehane <ronan.leh...@gmail.com> wrote: > Hi All, > > I was wondering if anyone could help me figure out what's going wrong in my > five node Hadoop cluster, please? > > It consists of: > 1. NameNode > hduser@namenode:/usr/local/hadoop$ jps > 13049 DataNode > 13387 Jps > 12740 NameNode > 13316 SecondaryNameNode > > 2. JobTracker > hduser@jobtracker:/usr/local/hadoop$ jps > 21817 TaskTracker > 21448 DataNode > 21542 JobTracker > 21862 Jps > > 3. Slave1 > hduser@slave1:/usr/local/hadoop$ jps > 21226 DataNode > 21514 Jps > 21463 TaskTracker > > 4. Slave2 > hduser@slave2:/usr/local/hadoop$ jps > 20938 Jps > 20650 DataNode > 20887 TaskTracker > > 5. Slave3 > hduser@slave3:/usr/local/hadoop$ jps > 22145 Jps > 21854 DataNode > 22091 TaskTracker > > All DataNodes have been kicked off by running start-dfs.sh on the NameNode > All TaskTrackers have been kicked off by running start-mapred.sh on the > JobTracker > > When I try to execute a simple wordcount job from the NameNode I receive > the following error: > 12/07/16 19:25:22 ERROR security.UserGroupInformation: > PriviledgedActionException as:hduser cause:java.net.ConnectException: Call > to jobtracker/10.21.68.218:54311 failed on connection exception: > java.net.ConnectException: Connection refused > > If I check the jobtracker: > 1. I can ping in both directions by both IP and Hostname > 2. I can see that the jobtracker is listening on port 54311 > tcp 0 0 127.0.0.1:54311 0.0.0.0:* > LISTEN 1001 425093 21542/java > 3. Telnet to this port from the NameNode fails with "Connection Refused" > telnet: Unable to connect to remote host: Connection refused > > This issue can be worked around by moving the JobTracker functionality to > the NameNode, but when this is done the job is executed on the NameNode > rather than distributed across the cluster. > Checking the log files on the slaves nodes, I see Server Not Available > messages referenced at the below wiki. > http://wiki.apache.org/hadoop/ServerNotAvailable > The Data Nodes not seeing the NameNode and the Task Trackers not seeing > JobTracker. > Checking the JobTracker web interface, it always states there is only 1 > node available. > > I've checked the 5 troubleshooting steps provided but it all looks to be ok > in my environment. > > Would anyone have any idea's of what could be causing this? > Any help would be appreciated. > > Cheers, > Ronan -- Harsh J