i just saw that the job finally completed. however it took one hour and 45 minutes - for a small job that runs in about 1-2 minutes on a single node (outside hadoop framework). The reduce part took extremely long - the output of the tasktracker shows about 5 of the sample sections for each of the copies (11 copies). so clearly something is limiting the reduce...

Any clues?

Thanks

Florian


Jobtracker log:
...
2008-01-28 19:08:01,981 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for task task_200801281756_0002_m_000007_0 2008-01-28 19:16:00,848 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200801281756_0001_m_000007_0 2008-01-28 19:32:33,485 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200801281756_0002_m_000008_0 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #3 for task task_200801281756_0001_m_000007_0 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task: task_200801281756_0001_m_000007_0 ... killing it 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200801281756_0001_m_000007_0: Too many fetch-failures 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobInProgress: Choosing normal task tip_200801281756_0001_m_000007 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200801281756_0001_m_000007_1' to tip tip_200801281756_0001_m_000007, for tracker 'tracker_localhost:/ 127.0.0.1:43625' 2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_000007_0' from 'tracker_localhost.localdomain:/127.0.0.1:37469' 2008-01-28 19:42:14,538 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200801281756_0001_m_000007_1' has completed tip_200801281756_0001_m_000007 successfully. 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #3 for task task_200801281756_0001_m_000006_0 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task: task_200801281756_0001_m_000006_0 ... killing it 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200801281756_0001_m_000006_0: Too many fetch-failures 2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress: Choosing normal task tip_200801281756_0001_m_000006 2008-01-28 19:51:44,626 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200801281756_0001_m_000006_1' to tip tip_200801281756_0001_m_000006, for tracker 'tracker_localhost:/ 127.0.0.1:43625' 2008-01-28 19:51:44,671 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_000006_0' from 'tracker_localhost.localdomain:/127.0.0.1:37469' 2008-01-28 19:51:46,503 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200801281756_0001_m_000006_1' has completed tip_200801281756_0001_m_000006 successfully. 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200801281756_0001_r_000000_0' to hdfs:// humanufactory.com:9000/output 2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200801281756_0001_r_000000_0' has completed tip_200801281756_0001_r_000000 successfully. 2008-01-28 19:51:51,779 INFO org.apache.hadoop.mapred.JobInProgress: Job job_200801281756_0001 has completed successfully. 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_000006_0' from 'tracker_localhost.localdomain:/127.0.0.1:37469' 2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_000007_0' from 'tracker_localhost.localdomain:/127.0.0.1:37469' 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_000002_0' from 'tracker_localhost:/127.0.0.1:43625' 2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_000004_0' from 'tracker_localhost:/127.0.0.1:43625'
...


Tasktracker log:
2008-01-28 19:07:02,844 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(task_200801281756_0001_m_000006_0,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200801281756_0001_m_000006_0/file.out.index in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead (LocalDirAllocator.java:138) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet (TaskTracker.java:1967)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java: 427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch (WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle (ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle (WebApplicationContext.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at org.mortbay.http.SocketListener.handleConnection (SocketListener.java:244)
        at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child with bad map output: task_200801281756_0001_m_000006_0. Ignored. 2008-01-28 19:07:08,864 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:14,884 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:17,894 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:23,914 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:29,934 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:32,944 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:38,964 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:44,983 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) > 2008-01-28 19:07:47,994 INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11 at 0.00 MB/s) >









On Jan 28, 2008, at 8:43 PM, Amar Kamat wrote:

Hi,
Can you plz attach the Jobtracker and any of the stuck reducer's log? Is there any way I can reproduce this error?
Amar
Florian Leibert wrote:
hi,
i setup a cluster with 3 slaves and 1 master. when running the wordcount example, hadoop execution halts after the mapping phase. It has started the reduce phase and generally gets to 10-20% but halts there. I left it running for 30 minutes, no change. Log files don't show any errors... I reformatted the Hadoop FS that i created, access to the files from all nodes seems fine.
Has anyone seen a similar behavior?

Thanks,
Florian




Reply via email to