Hi,

 I faced a similar problem sometime back.
I think its the network/ communication latency between master and slaves
that is an issue in your case. Try increasing the timeout interval in
hadoop-site.xml.

V.V.Chaitanya Krishna
IIIT,Hyderabad
India

On Thu, Oct 16, 2008 at 4:53 AM, Lucas Di Pentima
<[EMAIL PROTECTED]>wrote:

>  Hello all,
>
> I'm new to this list and to Hadoop too. I'm testing some basic
> configurations before I start to own my own experiments. I've installed a
> Hadoop cluster of 2 machines as explained here:
>
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>
>
> I'm not using Ubuntu, but Debian Lenny, with Java 1.6.x and Hadoop 0.18.1
> installed on the systems.
>
> All daemons are running correctly, and the HDFS is working properly, with
> the default replication level of 2 so that all files are replicated in both
> PCs. Both hosts have the clock set up correctly.
>
> The problem begins when I try to run the classic wordcount test, I load
> some Gutemberg files on the HDFS, and then:
>
> $ bin/hadoop jar hadoop-0.18.1-examples.jar wordcount gutemberg
> gutemberg-output
>
> The map phase starts and reaches 100%, then reduce starts and freezes at
> approx 14%. I waited several minutes but the job didn't finish.
>
> running "hadoop job -list" gives me this ouput:
>
> $ bin/hadoop job -list
> 1 jobs currently running
> JobId State StartTime UserName
> job_200810151758_0003 1 1224106290709 hadoop
>
> ...and I can kill it successfully.
>
> In my last test I left the job running and 1 hour later it was terminated
> with the following messages:
>
>
> 08/10/15 20:56:38 INFO mapred.JobClient: Task Id :
> attempt_200810151952_0001_m_000002_0, Status : FAILED
> Too many fetch-failures
> 08/10/15 20:59:47 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/10/15 21:02:56 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/10/15 21:02:57 INFO mapred.JobClient: Job complete:
> job_200810151952_0001
> 08/10/15 21:02:57 INFO mapred.JobClient: Counters: 16
> 08/10/15 21:02:57 INFO mapred.JobClient:   File Systems
> 08/10/15 21:02:57 INFO mapred.JobClient:     HDFS bytes read=6945126
> 08/10/15 21:02:57 INFO mapred.JobClient:     HDFS bytes written=1410309
> 08/10/15 21:02:57 INFO mapred.JobClient:     Local bytes read=3472685
> 08/10/15 21:02:57 INFO mapred.JobClient:     Local bytes written=6422750
> 08/10/15 21:02:57 INFO mapred.JobClient:   Job Counters
> 08/10/15 21:02:57 INFO mapred.JobClient:     Launched reduce tasks=1
> 08/10/15 21:02:57 INFO mapred.JobClient:     Launched map tasks=12
> 08/10/15 21:02:57 INFO mapred.JobClient:     Data-local map tasks=12
> 08/10/15 21:02:57 INFO mapred.JobClient:   Map-Reduce Framework
> 08/10/15 21:02:57 INFO mapred.JobClient:     Reduce input groups=128360
> 08/10/15 21:02:57 INFO mapred.JobClient:     Combine output records=329346
> 08/10/15 21:02:57 INFO mapred.JobClient:     Map input records=137114
> 08/10/15 21:02:57 INFO mapred.JobClient:     Reduce output records=128360
> 08/10/15 21:02:57 INFO mapred.JobClient:     Map output bytes=11428977
> 08/10/15 21:02:57 INFO mapred.JobClient:     Map input bytes=6945126
> 08/10/15 21:02:57 INFO mapred.JobClient:     Combine input records=1375481
> 08/10/15 21:02:57 INFO mapred.JobClient:     Map output records=1174495
> 08/10/15 21:02:57 INFO mapred.JobClient:     Reduce input records=128360
>
> When I start the cluster with only the master server (namenode, jobtracker,
> datanode and tasktracker) the job works perfectly, so I suppose there's some
> problem with the communication against the slave node.
>
> Any help will be appreciated.
>
>   --
> Lucas Di Pentima - http://lucas.di-pentima.com.ar
> GnuPG Public Key:
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x6AA54FC9
> Key fingerprint = BD3B 08C4 661A 8C3B 1855  740C 8F98 3FCF 6AA5 4FC9
>

Reply via email to