Re: hadoop 0.15.3 r612257 freezes on reduce task
Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: hadoop 0.15.3 r612257 freezes on reduce task
Also, I'm running hadoop 0.16.1 :) On Fri, Mar 28, 2008 at 1:23 PM, Bradford Stephens [EMAIL PROTECTED] wrote: Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
RE: hadoop 0.15.3 r612257 freezes on reduce task
Hi Bradford, Could you please check what your mapred.local.dir is set to? Devaraj. -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Saturday, March 29, 2008 1:54 AM To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Subject: Re: hadoop 0.15.3 r612257 freezes on reduce task Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: hadoop 0.15.3 r612257 freezes on reduce task
Thanks for the hint, Deveraj! I was using paths for the mapred.local.dir that was based on ~/, so I gave it an absolute path instead. Also, the directory for hadoop.tmp.dir did not exist on one machine :) On Fri, Mar 28, 2008 at 2:00 PM, Devaraj Das [EMAIL PROTECTED] wrote: Hi Bradford, Could you please check what your mapred.local.dir is set to? Devaraj. -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Saturday, March 29, 2008 1:54 AM To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Subject: Re: hadoop 0.15.3 r612257 freezes on reduce task Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: hadoop 0.15.3 r612257 freezes on reduce task
We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: hadoop 0.15.3 r612257 freezes on reduce task
That was the error that we were seeing in our hung reduce tasks. It went away for us, and we never figured out why. A number of things happened in our environment around the time it went a way. We shifted to 0.15.2, our cluster moved to a separate switched vlan from our main network, we started using different machines for our cluster. Florian Leibert wrote: Hi, I got some more logs (from the other nodes) - maybe this leads to some conclusions: ###Node 1 2008-01-28 18:07:08,017 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_6263175697396671978 is valid, and cannot be written to. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:1257) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804) at java.lang.Thread.run(Unknown Source) ###Node 2 008-01-28 18:07:08,109 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_6263175697396671978 to 10.253.14.144:50010 got java.io.IOException: operation failed at /10.253.14.144 at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704) at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463) at java.lang.Thread.run(Thread.java:619) ###Node 3 2008-01-28 17:57:45,120 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-3751486067814847527 to 10.253.19.0:50010 got java.io.IOException: operation failed at /10.253.19.0 at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704) at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463) at java.lang.Thread.run(Thread.java:619) 2008-01-28 17:57:45,372 INFO org.apache.hadoop.dfs.DataNode: Served block blk_8941456674455415759 to /10.253.34.241 2008-01-28 17:57:45,383 INFO org.apache.hadoop.dfs.DataNode: Served block blk_-3751486067814847527 to /10.253.34.241 2008-01-28 18:07:02,026 INFO org.apache.hadoop.dfs.DataNode: Received block blk_-2349720010162881555 from /10.253.14.144 ... Thanks, Florian On Jan 29, 2008, at 4:55 AM, Amar Kamat wrote: It seems that the reducers were not able to copy the output from the mappers (reduce % 33 means that the copy of map output is over, i.e the shuffle phase), waited long for the mappers to recover and then finally after waiting for long (this is expected) the Jobtracker killed the map and re-executed on the local-machine and hence the job got completed. It takes 15 min on an average for a map to get killed by one mapper. It seems to be a disk problem on the machine where task_200801281756_0002_m_06_0, task_200801281756_0002_m_07_0 and task_200801281756_0002_m_08_0 got scheduled. Can you check if the disk space/health of these machines? Amar Florian Leibert wrote: i just saw that the job finally completed. however it took one hour and 45 minutes - for a small job that runs in about 1-2 minutes on a single node (outside hadoop framework). The reduce part took extremely long - the output of the tasktracker shows about 5 of the sample sections for each of the copies (11 copies). so clearly something is limiting the reduce... Any clues? Thanks Florian Jobtracker log: ... 2008-01-28 19:08:01,981 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for task task_200801281756_0002_m_07_0 2008-01-28 19:16:00,848 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200801281756_0001_m_07_0 2008-01-28 19:32:33,485 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200801281756_0002_m_08_0 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #3 for task task_200801281756_0001_m_07_0 2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task: task_200801281756_0001_m_07_0 ... killing it 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200801281756_0001_m_07_0: Too many fetch-failures 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobInProgress: Choosing normal task tip_200801281756_0001_m_07 2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200801281756_0001_m_07_1' to tip tip_200801281756_0001_m_07, for tracker 'tracker_localhost:/127.0.0.1:43625' 2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200801281756_0001_m_07_0' from 'tracker_localhost.localdomain:/127.0.0.1:37469' 2008-01-28 19:42:14,538 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200801281756_0001_m_07_1' has completed tip_200801281756_0001_m_07 successfully. 2008-01-28 19:51:44,625 INFO