Hi,
I got some more logs (from the other nodes) - maybe this leads to
some conclusions:
###Node 1
2008-01-28 18:07:08,017 ERROR org.apache.hadoop.dfs.DataNode:
DataXceiver: java.io.IOException: Block blk_6263175697396671978 is
valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:
1257)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock
(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Unknown Source)
###Node 2
008-01-28 18:07:08,109 WARN org.apache.hadoop.dfs.DataNode: Failed to
transfer blk_6263175697396671978 to 10.253.14.144:50010 got
java.io.IOException: operation failed at /10.253.14.144
at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
at java.lang.Thread.run(Thread.java:619)
###Node 3
2008-01-28 17:57:45,120 WARN org.apache.hadoop.dfs.DataNode: Failed
to transfer blk_-3751486067814847527 to 10.253.19.0:50010 got
java.io.IOException: operation failed at /10.253.19.0
at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
at java.lang.Thread.run(Thread.java:619)
2008-01-28 17:57:45,372 INFO org.apache.hadoop.dfs.DataNode: Served
block blk_8941456674455415759 to /10.253.34.241
2008-01-28 17:57:45,383 INFO org.apache.hadoop.dfs.DataNode: Served
block blk_-3751486067814847527 to /10.253.34.241
2008-01-28 18:07:02,026 INFO org.apache.hadoop.dfs.DataNode: Received
block blk_-2349720010162881555 from /10.253.14.144
...
Thanks,
Florian
On Jan 29, 2008, at 4:55 AM, Amar Kamat wrote:
It seems that the reducers were not able to copy the output from
the mappers (reduce % > 33 means that the copy of map output is
over, i.e the shuffle phase), waited long for the mappers to
recover and then finally after waiting for long (this is expected)
the Jobtracker killed the map and re-executed on the local-machine
and hence the job got completed. It takes 15 min on an average for
a map to get killed by one mapper. It seems to be a disk problem on
the machine where task_200801281756_0002_m_000006_0,
task_200801281756_0002_m_000007_0 and
task_200801281756_0002_m_000008_0 got scheduled. Can you check if
the disk space/health of these machines?
Amar
Florian Leibert wrote:
i just saw that the job finally completed. however it took one
hour and 45 minutes - for a small job that runs in about 1-2
minutes on a single node (outside hadoop framework). The reduce
part took extremely long - the output of the tasktracker shows
about 5 of the sample sections for each of the copies (11 copies).
so clearly something is limiting the reduce...
Any clues?
Thanks
Florian
Jobtracker log:
...
2008-01-28 19:08:01,981 INFO
org.apache.hadoop.mapred.JobInProgress: Failed fetch notification
#1 for task task_200801281756_0002_m_000007_0
2008-01-28 19:16:00,848 INFO
org.apache.hadoop.mapred.JobInProgress: Failed fetch notification
#2 for task task_200801281756_0001_m_000007_0
2008-01-28 19:32:33,485 INFO
org.apache.hadoop.mapred.JobInProgress: Failed fetch notification
#2 for task task_200801281756_0002_m_000008_0
2008-01-28 19:42:12,822 INFO
org.apache.hadoop.mapred.JobInProgress: Failed fetch notification
#3 for task task_200801281756_0001_m_000007_0
2008-01-28 19:42:12,822 INFO
org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures
for output of task: task_200801281756_0001_m_000007_0 ... killing it
2008-01-28 19:42:12,823 INFO
org.apache.hadoop.mapred.TaskInProgress: Error from
task_200801281756_0001_m_000007_0: Too many fetch-failures
2008-01-28 19:42:12,823 INFO
org.apache.hadoop.mapred.JobInProgress: Choosing normal task
tip_200801281756_0001_m_000007
2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'task_200801281756_0001_m_000007_1' to tip
tip_200801281756_0001_m_000007, for tracker 'tracker_localhost:/
127.0.0.1:43625'
2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000007_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:42:14,538 INFO
org.apache.hadoop.mapred.JobInProgress: Task
'task_200801281756_0001_m_000007_1' has completed
tip_200801281756_0001_m_000007 successfully.
2008-01-28 19:51:44,625 INFO
org.apache.hadoop.mapred.JobInProgress: Failed fetch notification
#3 for task task_200801281756_0001_m_000006_0
2008-01-28 19:51:44,625 INFO
org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures
for output of task: task_200801281756_0001_m_000006_0 ... killing it
2008-01-28 19:51:44,625 INFO
org.apache.hadoop.mapred.TaskInProgress: Error from
task_200801281756_0001_m_000006_0: Too many fetch-failures
2008-01-28 19:51:44,625 INFO
org.apache.hadoop.mapred.JobInProgress: Choosing normal task
tip_200801281756_0001_m_000006
2008-01-28 19:51:44,626 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'task_200801281756_0001_m_000006_1' to tip
tip_200801281756_0001_m_000006, for tracker 'tracker_localhost:/
127.0.0.1:43625'
2008-01-28 19:51:44,671 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000006_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:51:46,503 INFO
org.apache.hadoop.mapred.JobInProgress: Task
'task_200801281756_0001_m_000006_1' has completed
tip_200801281756_0001_m_000006 successfully.
2008-01-28 19:51:51,768 INFO org.apache.hadoop.mapred.TaskRunner:
Saved output of task 'task_200801281756_0001_r_000000_0' to hdfs://
humanufactory.com:9000/output
2008-01-28 19:51:51,768 INFO
org.apache.hadoop.mapred.JobInProgress: Task
'task_200801281756_0001_r_000000_0' has completed
tip_200801281756_0001_r_000000 successfully.
2008-01-28 19:51:51,779 INFO
org.apache.hadoop.mapred.JobInProgress: Job job_200801281756_0001
has completed successfully.
2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000006_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:51:54,680 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000007_0' from
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000002_0' from
'tracker_localhost:/127.0.0.1:43625'
2008-01-28 19:51:56,674 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'task_200801281756_0001_m_000004_0' from
'tracker_localhost:/127.0.0.1:43625'
...
Tasktracker log:
2008-01-28 19:07:02,844 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker:
getMapOutput(task_200801281756_0001_m_000006_0,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not
find task_200801281756_0001_m_000006_0/file.out.index in any of
the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead
(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet
(TaskTracker.java:1967)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle
(ServletHolder.java:427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch
(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle
(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle
(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:
814)
at org.mortbay.http.HttpConnection.handleNext
(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:
831)
at org.mortbay.http.SocketListener.handleConnection
(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:
357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:
534)
2008-01-28 19:07:08,330 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child with bad map output:
task_200801281756_0001_m_000006_0. Ignored.
2008-01-28 19:07:08,864 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:14,884 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:17,894 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:23,914 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:29,934 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:32,944 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:38,964 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:44,983 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
2008-01-28 19:07:47,994 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
On Jan 28, 2008, at 8:43 PM, Amar Kamat wrote:
Hi,
Can you plz attach the Jobtracker and any of the stuck reducer's
log? Is there any way I can reproduce this error?
Amar
Florian Leibert wrote:
hi,
i setup a cluster with 3 slaves and 1 master. when running the
wordcount example, hadoop execution halts after the mapping
phase. It has started the reduce phase and generally gets to
10-20% but halts there. I left it running for 30 minutes, no
change. Log files don't show any errors...
I reformatted the Hadoop FS that i created, access to the files
from all nodes seems fine.
Has anyone seen a similar behavior?
Thanks,
Florian