i just saw that the job finally completed. however it took one hour
and 45 minutes - for a small job that runs in about 1-2 minutes on a
single node (outside hadoop framework). The reduce part took
extremely long - the output of the tasktracker shows about 5 of the
sample sections for each of the copies (11 copies). so clearly
something is limiting the reduce...
Any clues?
Thanks
Florian
Jobtracker log:
...
2008-01-28 19:08:01,981 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #1 for task task_200801281756_0002_m_000007_0
2008-01-28 19:16:00,848 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #2 for task task_200801281756_0001_m_000007_0
2008-01-28 19:32:33,485 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #2 for task task_200801281756_0002_m_000008_0
2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #3 for task task_200801281756_0001_m_000007_0
2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress:
Too many fetch-failures for output of task:
task_200801281756_0001_m_000007_0 ... killing it
2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from task_200801281756_0001_m_000007_0: Too many fetch-failures
2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing normal task tip_200801281756_0001_m_000007
2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'task_200801281756_0001_m_000007_1' to tip
tip_200801281756_0001_m_000007, for tracker 'tracker_localhost:/
127.0.0.1:43625'
2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000007_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:42:14,538 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'task_200801281756_0001_m_000007_1' has completed
tip_200801281756_0001_m_000007 successfully.
2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #3 for task task_200801281756_0001_m_000006_0
2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress:
Too many fetch-failures for output of task:
task_200801281756_0001_m_000006_0 ... killing it
2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from task_200801281756_0001_m_000006_0: Too many fetch-failures
2008-01-28 19:51:44,625 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing normal task tip_200801281756_0001_m_000006
2008-01-28 19:51:44,626 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'task_200801281756_0001_m_000006_1' to tip
tip_200801281756_0001_m_000006, for tracker 'tracker_localhost:/
127.0.0.1:43625'
2008-01-28 19:51:44,671 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000006_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:51:46,503 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'task_200801281756_0001_m_000006_1' has completed
tip_200801281756_0001_m_000006 successfully.
2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.TaskRunner:
Saved output of task 'task_200801281756_0001_r_000000_0' to hdfs://
humanufactory.com:9000/output
2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'task_200801281756_0001_r_000000_0' has completed
tip_200801281756_0001_r_000000 successfully.
2008-01-28 19:51:51,779 INFO org.apache.hadoop.mapred.JobInProgress:
Job job_200801281756_0001 has completed successfully.
2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000006_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000007_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000002_0' from
'tracker_localhost:/127.0.0.1:43625'
2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000004_0' from
'tracker_localhost:/127.0.0.1:43625'
...
Tasktracker log:
2008-01-28 19:07:02,844 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker:
getMapOutput(task_200801281756_0001_m_000006_0,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
task_200801281756_0001_m_000006_0/file.out.index in any of the
configured local directories
at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead
(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet
(TaskTracker.java:1967)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:
427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch
(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle
(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle
(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at org.mortbay.http.SocketListener.handleConnection
(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child with bad map output: task_200801281756_0001_m_000006_0.
Ignored.
2008-01-28 19:07:08,864 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:14,884 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:17,894 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:23,914 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:29,934 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:32,944 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:38,964 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:44,983 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
2008-01-28 19:07:47,994 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of 11
at 0.00 MB/s) >
On Jan 28, 2008, at 8:43 PM, Amar Kamat wrote:
Hi,
Can you plz attach the Jobtracker and any of the stuck reducer's
log? Is there any way I can reproduce this error?
Amar
Florian Leibert wrote:
hi,
i setup a cluster with 3 slaves and 1 master. when running the
wordcount example, hadoop execution halts after the mapping phase.
It has started the reduce phase and generally gets to 10-20% but
halts there. I left it running for 30 minutes, no change. Log
files don't show any errors...
I reformatted the Hadoop FS that i created, access to the files
from all nodes seems fine.
Has anyone seen a similar behavior?
Thanks,
Florian