Seems I was looking in the wrong log file :) ... was looking at the tasktracker when i should be looking underneath!
It was a problem with the HDFS breaking because the machines couldn't find each other... they are configured with IP's in hadoop-site.xml but when the cluster is running they (somehow) try to resolve each others hostnames... Know why? Fixed it by adding the nodes hostnames to each others /etc/hosts... Pedro Pedro Guedes wrote: > Well, moving to 0.11.2 won't fix it... tried that! > > The first interesting thing in the log is: > 2007-04-02 15:45:41,960 WARN org.apache.hadoop.mapred.TaskRunner: > java.io.IOException: File > /home/ciclope/hadoop-install/hadoop-data/mapred/local/task_0001_r_000001_0/map_10.out-0 > not created > at > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:282) > at > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:243) > > And the tasktracker of the slave node keeps repeating himself with: > > 2007-04-02 15:47:03,089 INFO org.apache.hadoop.mapred.TaskRunner: > task_0001_r_000003_0 Need 12 map output(s) > 2007-04-02 15:47:03,089 INFO org.apache.hadoop.mapred.TaskRunner: > task_0001_r_000003_0 Got 12 known map output location(s); scheduling... > 2007-04-02 15:47:03,089 INFO org.apache.hadoop.mapred.TaskRunner: > task_0001_r_000003_0 Scheduled 0 of 12 known outputs (12 slow hosts and > 0 dup hosts) > 2007-04-02 15:47:03,273 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000001_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:03,969 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000003_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:04,277 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000001_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:04,973 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000003_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:05,281 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000001_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:05,977 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000003_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:06,285 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000001_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > 2007-04-02 15:47:06,981 INFO org.apache.hadoop.mapred.TaskTracker: > task_0001_r_000003_0 0.14285715% reduce > copy (9 of 21 at 0.00 MB/s) > > > > Pedro Guedes wrote: > >> Hi hadooping people... >> >> I'm having trouble running the wordcount example with hadoop... i ran it >> ok with only one host but when i add another machine to the cluster... >> it falls apart! :( >> >> I read in the malling-list archive about someone having a similar >> problem but the proposed solution was to downgrade to 0.11.2 (from >> 0.12.0, I'm using 0.12.2)... is that right? A reference here: >> http://www.mail-archive.com/[email protected]/msg00863.html >> >> The only difference in my case is that mine hangs around 60% of the >> reduce phase... but the tasktracker for the slave node shows the same >> 'IOException: 'file .....mapx_out not created' and that's the only error >> i see... >> >> Any sugestions? >> >> thanks in advance... >> >> Pedro >> >> >> > > >
