Hi, I faced a similar problem sometime back. I think its the network/ communication latency between master and slaves that is an issue in your case. Try increasing the timeout interval in hadoop-site.xml.
V.V.Chaitanya Krishna IIIT,Hyderabad India On Thu, Oct 16, 2008 at 4:53 AM, Lucas Di Pentima <[EMAIL PROTECTED]>wrote: > Hello all, > > I'm new to this list and to Hadoop too. I'm testing some basic > configurations before I start to own my own experiments. I've installed a > Hadoop cluster of 2 machines as explained here: > > > http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29> > > I'm not using Ubuntu, but Debian Lenny, with Java 1.6.x and Hadoop 0.18.1 > installed on the systems. > > All daemons are running correctly, and the HDFS is working properly, with > the default replication level of 2 so that all files are replicated in both > PCs. Both hosts have the clock set up correctly. > > The problem begins when I try to run the classic wordcount test, I load > some Gutemberg files on the HDFS, and then: > > $ bin/hadoop jar hadoop-0.18.1-examples.jar wordcount gutemberg > gutemberg-output > > The map phase starts and reaches 100%, then reduce starts and freezes at > approx 14%. I waited several minutes but the job didn't finish. > > running "hadoop job -list" gives me this ouput: > > $ bin/hadoop job -list > 1 jobs currently running > JobId State StartTime UserName > job_200810151758_0003 1 1224106290709 hadoop > > ...and I can kill it successfully. > > In my last test I left the job running and 1 hour later it was terminated > with the following messages: > > > 08/10/15 20:56:38 INFO mapred.JobClient: Task Id : > attempt_200810151952_0001_m_000002_0, Status : FAILED > Too many fetch-failures > 08/10/15 20:59:47 WARN mapred.JobClient: Error reading task > outputConnection timed out > 08/10/15 21:02:56 WARN mapred.JobClient: Error reading task > outputConnection timed out > 08/10/15 21:02:57 INFO mapred.JobClient: Job complete: > job_200810151952_0001 > 08/10/15 21:02:57 INFO mapred.JobClient: Counters: 16 > 08/10/15 21:02:57 INFO mapred.JobClient: File Systems > 08/10/15 21:02:57 INFO mapred.JobClient: HDFS bytes read=6945126 > 08/10/15 21:02:57 INFO mapred.JobClient: HDFS bytes written=1410309 > 08/10/15 21:02:57 INFO mapred.JobClient: Local bytes read=3472685 > 08/10/15 21:02:57 INFO mapred.JobClient: Local bytes written=6422750 > 08/10/15 21:02:57 INFO mapred.JobClient: Job Counters > 08/10/15 21:02:57 INFO mapred.JobClient: Launched reduce tasks=1 > 08/10/15 21:02:57 INFO mapred.JobClient: Launched map tasks=12 > 08/10/15 21:02:57 INFO mapred.JobClient: Data-local map tasks=12 > 08/10/15 21:02:57 INFO mapred.JobClient: Map-Reduce Framework > 08/10/15 21:02:57 INFO mapred.JobClient: Reduce input groups=128360 > 08/10/15 21:02:57 INFO mapred.JobClient: Combine output records=329346 > 08/10/15 21:02:57 INFO mapred.JobClient: Map input records=137114 > 08/10/15 21:02:57 INFO mapred.JobClient: Reduce output records=128360 > 08/10/15 21:02:57 INFO mapred.JobClient: Map output bytes=11428977 > 08/10/15 21:02:57 INFO mapred.JobClient: Map input bytes=6945126 > 08/10/15 21:02:57 INFO mapred.JobClient: Combine input records=1375481 > 08/10/15 21:02:57 INFO mapred.JobClient: Map output records=1174495 > 08/10/15 21:02:57 INFO mapred.JobClient: Reduce input records=128360 > > When I start the cluster with only the master server (namenode, jobtracker, > datanode and tasktracker) the job works perfectly, so I suppose there's some > problem with the communication against the slave node. > > Any help will be appreciated. > > -- > Lucas Di Pentima - http://lucas.di-pentima.com.ar > GnuPG Public Key: > http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x6AA54FC9 > Key fingerprint = BD3B 08C4 661A 8C3B 1855 740C 8F98 3FCF 6AA5 4FC9 >
