Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Bradford Stephens
Hey everyone,

I'm having a similar problem:

Map output lost, rescheduling:
getMapOutput(task_200803281212_0001_m_00_2,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
task_200803281212_0001_m_00_2/file.out.index in any of the
configured local directories

Then it fails in about 10 minutes. I'm just trying to grep some etexts.

New HDFS installation on 2 nodes (one master, one slave). Ubuntu
Linux, Dell Core 2 Duo processors, Java 1.5.0.

I have a feeling its a configuration issue. Anyone else run into it?


On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote:
 We are running under linux with dfs on GiGE lans,  kernel
  2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors.
  Our replacation factor was set to 3



  Florian Leibert wrote:
   Maybe it helps to know that we're running Hadoop inside amazon's EC2...
  
   Thanks,
   Florian
  

  --
  Jason Venner
  Attributor - Publish with Confidence http://www.attributor.com/
  Attributor is hiring Hadoop Wranglers, contact if interested



Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Bradford Stephens
Also, I'm running hadoop 0.16.1 :)

On Fri, Mar 28, 2008 at 1:23 PM, Bradford Stephens
[EMAIL PROTECTED] wrote:
 Hey everyone,

  I'm having a similar problem:

  Map output lost, rescheduling:
  getMapOutput(task_200803281212_0001_m_00_2,0) failed :

 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
  task_200803281212_0001_m_00_2/file.out.index in any of the
  configured local directories

  Then it fails in about 10 minutes. I'm just trying to grep some etexts.

  New HDFS installation on 2 nodes (one master, one slave). Ubuntu
  Linux, Dell Core 2 Duo processors, Java 1.5.0.

  I have a feeling its a configuration issue. Anyone else run into it?




  On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote:
   We are running under linux with dfs on GiGE lans,  kernel
2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors.
Our replacation factor was set to 3
  
  
  
Florian Leibert wrote:
 Maybe it helps to know that we're running Hadoop inside amazon's EC2...

 Thanks,
 Florian

  
--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested
  



RE: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Devaraj Das
Hi Bradford,
Could you please check what your mapred.local.dir is set to?
Devaraj. 

 -Original Message-
 From: Bradford Stephens [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, March 29, 2008 1:54 AM
 To: core-user@hadoop.apache.org
 Cc: [EMAIL PROTECTED]
 Subject: Re: hadoop 0.15.3 r612257 freezes on reduce task
 
 Hey everyone,
 
 I'm having a similar problem:
 
 Map output lost, rescheduling:
 getMapOutput(task_200803281212_0001_m_00_2,0) failed :
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could 
 not find task_200803281212_0001_m_00_2/file.out.index in 
 any of the configured local directories
 
 Then it fails in about 10 minutes. I'm just trying to grep 
 some etexts.
 
 New HDFS installation on 2 nodes (one master, one slave). 
 Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0.
 
 I have a feeling its a configuration issue. Anyone else run into it?
 
 
 On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner 
 [EMAIL PROTECTED] wrote:
  We are running under linux with dfs on GiGE lans,  kernel  
  2.6.15-1.2054_FC5smp, with a variety of xeon steppings for 
 our processors.
   Our replacation factor was set to 3
 
 
 
   Florian Leibert wrote:
Maybe it helps to know that we're running Hadoop inside 
 amazon's EC2...
   
Thanks,
Florian
   
 
   --
   Jason Venner
   Attributor - Publish with Confidence http://www.attributor.com/  
  Attributor is hiring Hadoop Wranglers, contact if interested
 
 



Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Bradford Stephens
Thanks for the hint, Deveraj! I was using paths for the
mapred.local.dir that was based on ~/, so I gave it an absolute path
instead. Also, the directory for hadoop.tmp.dir did not exist on one
machine :)


On Fri, Mar 28, 2008 at 2:00 PM, Devaraj Das [EMAIL PROTECTED] wrote:
 Hi Bradford,
  Could you please check what your mapred.local.dir is set to?
  Devaraj.



   -Original Message-
   From: Bradford Stephens [mailto:[EMAIL PROTECTED]
   Sent: Saturday, March 29, 2008 1:54 AM
   To: core-user@hadoop.apache.org
   Cc: [EMAIL PROTECTED]
   Subject: Re: hadoop 0.15.3 r612257 freezes on reduce task
  
   Hey everyone,
  
   I'm having a similar problem:
  
   Map output lost, rescheduling:
   getMapOutput(task_200803281212_0001_m_00_2,0) failed :
   org.apache.hadoop.util.DiskChecker$DiskErrorException: Could
   not find task_200803281212_0001_m_00_2/file.out.index in
   any of the configured local directories
  
   Then it fails in about 10 minutes. I'm just trying to grep
   some etexts.
  
   New HDFS installation on 2 nodes (one master, one slave).
   Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0.
  
   I have a feeling its a configuration issue. Anyone else run into it?
  
  
   On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner
   [EMAIL PROTECTED] wrote:
We are running under linux with dfs on GiGE lans,  kernel
2.6.15-1.2054_FC5smp, with a variety of xeon steppings for
   our processors.
 Our replacation factor was set to 3
   
   
   
 Florian Leibert wrote:
  Maybe it helps to know that we're running Hadoop inside
   amazon's EC2...
 
  Thanks,
  Florian
 
   
 --
 Jason Venner
 Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested
   
  




Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-01-29 Thread Jason Venner
We are running under linux with dfs on GiGE lans,  kernel 
2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors.

Our replacation factor was set to 3

Florian Leibert wrote:

Maybe it helps to know that we're running Hadoop inside amazon's EC2...

Thanks,
Florian



--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested


Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-01-29 Thread Jason Venner
That was the error that we were seeing in our hung reduce tasks. It went 
away for us, and we never figured out why. A number of things happened 
in our environment around the time it went a way.
We shifted to 0.15.2, our cluster moved to a separate switched vlan from 
our main network, we started using different machines for our cluster.


Florian Leibert wrote:

Hi,
I got some more logs (from the other nodes) - maybe this leads to some 
conclusions:


###Node 1
2008-01-28 18:07:08,017 ERROR org.apache.hadoop.dfs.DataNode: 
DataXceiver: java.io.IOException: Block blk_6263175697396671978 is 
valid, and cannot be written to.

at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:1257)
at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)

at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Unknown Source)

###Node 2
008-01-28 18:07:08,109 WARN org.apache.hadoop.dfs.DataNode: Failed to 
transfer blk_6263175697396671978 to 10.253.14.144:50010 got 
java.io.IOException: operation failed at /10.253.14.144

at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
at java.lang.Thread.run(Thread.java:619)

###Node 3
2008-01-28 17:57:45,120 WARN org.apache.hadoop.dfs.DataNode: Failed to 
transfer blk_-3751486067814847527 to 10.253.19.0:50010 got 
java.io.IOException: operation failed at /10.253.19.0

at org.apache.hadoop.dfs.DataNode.receiveResponse(DataNode.java:704)
at org.apache.hadoop.dfs.DataNode.access$200(DataNode.java:77)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1463)
at java.lang.Thread.run(Thread.java:619)

2008-01-28 17:57:45,372 INFO org.apache.hadoop.dfs.DataNode: Served 
block blk_8941456674455415759 to /10.253.34.241
2008-01-28 17:57:45,383 INFO org.apache.hadoop.dfs.DataNode: Served 
block blk_-3751486067814847527 to /10.253.34.241
2008-01-28 18:07:02,026 INFO org.apache.hadoop.dfs.DataNode: Received 
block blk_-2349720010162881555 from /10.253.14.144

...

Thanks,
Florian

On Jan 29, 2008, at 4:55 AM, Amar Kamat wrote:

It seems that the reducers were not able to copy the output from the 
mappers (reduce %  33 means that the copy of map output is over, i.e 
the shuffle phase), waited long for the mappers to recover and then 
finally after waiting for long (this is expected) the Jobtracker 
killed the map and re-executed on the local-machine and hence the job 
got completed. It takes 15 min on an average for a map to get killed 
by one mapper. It seems to be a disk problem on the machine where 
task_200801281756_0002_m_06_0, task_200801281756_0002_m_07_0 
and task_200801281756_0002_m_08_0 got scheduled. Can you check if 
the disk space/health of these machines?

Amar
Florian Leibert wrote:
i just saw that the job finally completed. however it took one hour 
and 45 minutes - for a small job that runs in about 1-2 minutes on a 
single node (outside hadoop framework). The reduce part took 
extremely long - the output of the tasktracker shows about 5 of the 
sample sections for each of the copies (11 copies). so clearly 
something is limiting the reduce...


Any clues?

Thanks

Florian


Jobtracker log:
...
2008-01-28 19:08:01,981 INFO org.apache.hadoop.mapred.JobInProgress: 
Failed fetch notification #1 for task task_200801281756_0002_m_07_0
2008-01-28 19:16:00,848 INFO org.apache.hadoop.mapred.JobInProgress: 
Failed fetch notification #2 for task task_200801281756_0001_m_07_0
2008-01-28 19:32:33,485 INFO org.apache.hadoop.mapred.JobInProgress: 
Failed fetch notification #2 for task task_200801281756_0002_m_08_0
2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: 
Failed fetch notification #3 for task task_200801281756_0001_m_07_0
2008-01-28 19:42:12,822 INFO org.apache.hadoop.mapred.JobInProgress: 
Too many fetch-failures for output of task: 
task_200801281756_0001_m_07_0 ... killing it
2008-01-28 19:42:12,823 INFO 
org.apache.hadoop.mapred.TaskInProgress: Error from 
task_200801281756_0001_m_07_0: Too many fetch-failures
2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobInProgress: 
Choosing normal task tip_200801281756_0001_m_07
2008-01-28 19:42:12,823 INFO org.apache.hadoop.mapred.JobTracker: 
Adding task 'task_200801281756_0001_m_07_1' to tip 
tip_200801281756_0001_m_07, for tracker 
'tracker_localhost:/127.0.0.1:43625'
2008-01-28 19:42:14,103 INFO org.apache.hadoop.mapred.JobTracker: 
Removed completed task 'task_200801281756_0001_m_07_0' from 
'tracker_localhost.localdomain:/127.0.0.1:37469'
2008-01-28 19:42:14,538 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'task_200801281756_0001_m_07_1' has completed 
tip_200801281756_0001_m_07 successfully.
2008-01-28 19:51:44,625 INFO