Re: JobTracker Faiing to respond with OutOfMemory error

2008-12-05 Thread charles du
Any update on this?

We got a similar problem after we ran a hadoop job with a lot of mappers.
Restarting jobtracker solved the problem for a few times. But right now, we
got the out of memory error right after we restarted the jobtracker. Thanks.





On Wed, Nov 19, 2008 at 8:40 PM, Palleti, Pallavi 
[EMAIL PROTECTED] wrote:

 Hi all,



  We are using hadoop-0.17.2 for some time now.  Since yesterday, We have
 been seeing jobTracker failing to respond with an OutOfMemory Error very
 frequently. Things are going fine after restarting it. But the problem
 is occurring after a while. Below is the exception that we are seeing in
 jobtracker logs.  Can someone please suggest what is going wrong in
 this?





 2008-11-19 14:17:46,059 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 9 on 9001, call
 heartbeat([EMAIL PROTECTED], false,
 true, 16068)

 from 205.188.170.107:51492: error: java.io.IOException:
 java.lang.OutOfMemoryError: Java heap space

 java.io.IOException: java.lang.OutOfMemoryError: Java heap space

at java.util.regex.Pattern.compile(Pattern.java:1452)

at java.util.regex.Pattern.init(Pattern.java:1133)

at java.util.regex.Pattern.compile(Pattern.java:847)

at java.lang.String.replace(String.java:2208)

at org.apache.hadoop.fs.Path.normalizePath(Path.java:146)

at org.apache.hadoop.fs.Path.initialize(Path.java:137)

at org.apache.hadoop.fs.Path.init(Path.java:126)

at org.apache.hadoop.fs.Path.init(Path.java:50)

at
 org.apache.hadoop.mapred.Task.getTaskOutputPath(Task.java:214)

at org.apache.hadoop.mapred.Task.setConf(Task.java:517)

at
 org.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java
 :745)

at
 org.apache.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.ja
 va:664)

at
 org.apache.hadoop.mapred.JobTracker.getNewTaskForTaskTracker(JobTracker.
 java:1585)

at
 org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1309)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
 a:39)

at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

 2008-11-19 14:18:11,869 WARN org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call
 heartbeat([EMAIL PROTECTED], false,
 true, 16077) f

 rom 205.188.170.84:54871: discarded for being too old (133957)

 2008-11-19 14:18:11,869 WARN org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call
 heartbeat([EMAIL PROTECTED], false,
 true, 16082)

 from 205.188.170.90:32934: discarded for being too old (133957)




 Thanks

 Pallavi




-- 
tp


Re: JobTracker Faiing to respond with OutOfMemory error

2008-12-05 Thread charles du
I found the following error message in
hadoop-middleware-jobtracker-dd-9c32d01.off.tn.ask.com.out

   Java HotSpot(TM) Server VM warning: Exception java.lang.OutOfMemoryError
occurred dispatching signal SIGTERM to handler- the VM may need to be
forcibly terminated


On Fri, Dec 5, 2008 at 10:58 AM, charles du [EMAIL PROTECTED] wrote:

 Any update on this?

 We got a similar problem after we ran a hadoop job with a lot of mappers.
 Restarting jobtracker solved the problem for a few times. But right now, we
 got the out of memory error right after we restarted the jobtracker. Thanks.





 On Wed, Nov 19, 2008 at 8:40 PM, Palleti, Pallavi 
 [EMAIL PROTECTED] wrote:

 Hi all,



  We are using hadoop-0.17.2 for some time now.  Since yesterday, We have
 been seeing jobTracker failing to respond with an OutOfMemory Error very
 frequently. Things are going fine after restarting it. But the problem
 is occurring after a while. Below is the exception that we are seeing in
 jobtracker logs.  Can someone please suggest what is going wrong in
 this?





 2008-11-19 14:17:46,059 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 9 on 9001, call
 heartbeat([EMAIL PROTECTED], false,
 true, 16068)

 from 205.188.170.107:51492: error: java.io.IOException:
 java.lang.OutOfMemoryError: Java heap space

 java.io.IOException: java.lang.OutOfMemoryError: Java heap space

at java.util.regex.Pattern.compile(Pattern.java:1452)

at java.util.regex.Pattern.init(Pattern.java:1133)

at java.util.regex.Pattern.compile(Pattern.java:847)

at java.lang.String.replace(String.java:2208)

at org.apache.hadoop.fs.Path.normalizePath(Path.java:146)

at org.apache.hadoop.fs.Path.initialize(Path.java:137)

at org.apache.hadoop.fs.Path.init(Path.java:126)

at org.apache.hadoop.fs.Path.init(Path.java:50)

at
 org.apache.hadoop.mapred.Task.getTaskOutputPath(Task.java:214)

at org.apache.hadoop.mapred.Task.setConf(Task.java:517)

at
 org.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java
 :745)

at
 org.apache.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.ja
 va:664)

at
 org.apache.hadoop.mapred.JobTracker.getNewTaskForTaskTracker(JobTracker.
 java:1585)

at
 org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1309)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
 a:39)

at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

 2008-11-19 14:18:11,869 WARN org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call
 heartbeat([EMAIL PROTECTED], false,
 true, 16077) f

 rom 205.188.170.84:54871: discarded for being too old (133957)

 2008-11-19 14:18:11,869 WARN org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 9001, call
 heartbeat([EMAIL PROTECTED], false,
 true, 16082)

 from 205.188.170.90:32934: discarded for being too old (133957)




 Thanks

 Pallavi




 --
 tp




-- 
tp


slow shuffle

2008-12-05 Thread Songting Chen
We encountered a bottleneck during the shuffle phase. However, there is not 
much data to be shuffled across the network at all - total less than 10MBytes 
(the combiner aggregated most of the data). 

Are there any parameters or anything we can tune to improve the shuffle 
performance?

Thanks,
-Songting


getting Configuration object in mapper

2008-12-05 Thread abhinit
I have set some variable using the JobConf object.

jobConf.set(Operator, operator) etc.

How can I get an instance of Configuration object/ JobConf object inside
a map method so that I can retrieve these variables.

Thanks
-Abhinit


Re: slow shuffle

2008-12-05 Thread Alex Loddengaard
These configuration options will be useful:

property
   namemapred.job.shuffle.merge.percent/name
   value0.66/value
   descriptionThe usage threshold at which an in-memory merge will be
   initiated, expressed as a percentage of the total memory allocated to
   storing in-memory map outputs, as defined by
   mapred.job.shuffle.input.buffer.percent.
   /description
 /property

 property
   namemapred.job.shuffle.input.buffer.percent/name
   value0.70/value
   descriptionThe percentage of memory to be allocated from the maximum
 heap
   size to storing map outputs during the shuffle.
   /description
 /property

 property
   namemapred.job.reduce.input.buffer.percent/name
   value0.0/value
   descriptionThe percentage of memory- relative to the maximum heap size-
 to
   retain map outputs during the reduce. When the shuffle is concluded, any
   remaining map outputs in memory must consume less than this threshold
 before
   the reduce can begin.
   /description
 /property


How long did the shuffle take relative to the rest of the job?

Alex

On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote:

 We encountered a bottleneck during the shuffle phase. However, there is not
 much data to be shuffled across the network at all - total less than
 10MBytes (the combiner aggregated most of the data).

 Are there any parameters or anything we can tune to improve the shuffle
 performance?

 Thanks,
 -Songting



stack trace from hung task

2008-12-05 Thread Sriram Rao
Hi,

When a task tracker kills a non-responsive task, it prints out a
message Task X not reported status for 600 seconds. Killing!.
The stack trace it then dumps out is that of the task tracker itself.
Is there a way to get the hung task to dump out its stack trace before
exiting?  Would be nice if there was an easy way to send a kill -3 to
the hung process and then kill it.

Sriram


Re: slow shuffle

2008-12-05 Thread Songting Chen
it takes 50% of the total time.


--- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: slow shuffle
 To: core-user@hadoop.apache.org
 Date: Friday, December 5, 2008, 11:43 AM
 These configuration options will be useful:
 
 property
   
 namemapred.job.shuffle.merge.percent/name
value0.66/value
descriptionThe usage threshold at which an
 in-memory merge will be
initiated, expressed as a percentage of the total
 memory allocated to
storing in-memory map outputs, as defined by
mapred.job.shuffle.input.buffer.percent.
/description
  /property
 
  property
   
 namemapred.job.shuffle.input.buffer.percent/name
value0.70/value
descriptionThe percentage of memory to be
 allocated from the maximum
  heap
size to storing map outputs during the shuffle.
/description
  /property
 
  property
   
 namemapred.job.reduce.input.buffer.percent/name
value0.0/value
descriptionThe percentage of memory-
 relative to the maximum heap size-
  to
retain map outputs during the reduce. When the
 shuffle is concluded, any
remaining map outputs in memory must consume less
 than this threshold
  before
the reduce can begin.
/description
  /property
 
 
 How long did the shuffle take relative to the rest of the
 job?
 
 Alex
 
 On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen
 [EMAIL PROTECTED]wrote:
 
  We encountered a bottleneck during the shuffle phase.
 However, there is not
  much data to be shuffled across the network at all -
 total less than
  10MBytes (the combiner aggregated most of the data).
 
  Are there any parameters or anything we can tune to
 improve the shuffle
  performance?
 
  Thanks,
  -Songting
 


Re: stack trace from hung task

2008-12-05 Thread Ryan LeCompte
For what it's worth, I started seeing these when I upgraded to 0.19. I
was using 10 reduces, but changed it to 30 reduces for my job and now
I don't see these errors any more.

Thanks,
Ryan


On Fri, Dec 5, 2008 at 2:44 PM, Sriram Rao [EMAIL PROTECTED] wrote:
 Hi,

 When a task tracker kills a non-responsive task, it prints out a
 message Task X not reported status for 600 seconds. Killing!.
 The stack trace it then dumps out is that of the task tracker itself.
 Is there a way to get the hung task to dump out its stack trace before
 exiting?  Would be nice if there was an easy way to send a kill -3 to
 the hung process and then kill it.

 Sriram



Re: getting Configuration object in mapper

2008-12-05 Thread Owen O'Malley


On Dec 4, 2008, at 9:19 PM, abhinit wrote:


I have set some variable using the JobConf object.

jobConf.set(Operator, operator) etc.

How can I get an instance of Configuration object/ JobConf object  
inside

a map method so that I can retrieve these variables.


In your Mapper class, implement a method like:
 public void configure(JobConf job) { ... }

This will be called when the object is created with the job conf.

-- Owen

Re: slow shuffle

2008-12-05 Thread Songting Chen
A little more information:

We optimized our Map process quite a bit that now the Shuffle becomes the 
bottleneck.

1. There are 300 Map jobs (128M size block), each takes about 13 sec.
2. The Reducer starts running at a very late stage (80% maps are done)
3. copy 300 map outputs (shuffle) takes as long as the entire map process, 
although each map output is just about 50Kbytes





--- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: slow shuffle
 To: core-user@hadoop.apache.org
 Date: Friday, December 5, 2008, 11:43 AM
 These configuration options will be useful:
 
 property
   
 namemapred.job.shuffle.merge.percent/name
value0.66/value
descriptionThe usage threshold at which an
 in-memory merge will be
initiated, expressed as a percentage of the total
 memory allocated to
storing in-memory map outputs, as defined by
mapred.job.shuffle.input.buffer.percent.
/description
  /property
 
  property
   
 namemapred.job.shuffle.input.buffer.percent/name
value0.70/value
descriptionThe percentage of memory to be
 allocated from the maximum
  heap
size to storing map outputs during the shuffle.
/description
  /property
 
  property
   
 namemapred.job.reduce.input.buffer.percent/name
value0.0/value
descriptionThe percentage of memory-
 relative to the maximum heap size-
  to
retain map outputs during the reduce. When the
 shuffle is concluded, any
remaining map outputs in memory must consume less
 than this threshold
  before
the reduce can begin.
/description
  /property
 
 
 How long did the shuffle take relative to the rest of the
 job?
 
 Alex
 
 On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen
 [EMAIL PROTECTED]wrote:
 
  We encountered a bottleneck during the shuffle phase.
 However, there is not
  much data to be shuffled across the network at all -
 total less than
  10MBytes (the combiner aggregated most of the data).
 
  Are there any parameters or anything we can tune to
 improve the shuffle
  performance?
 
  Thanks,
  -Songting
 


Re: getting Configuration object in mapper

2008-12-05 Thread Craig Macdonald
I have a related question - I have a class which is both mapper and 
reducer. How can I tell in configure() if the current task is map or a 
reduce task? Parse the taskid?


C

Owen O'Malley wrote:


On Dec 4, 2008, at 9:19 PM, abhinit wrote:


I have set some variable using the JobConf object.

jobConf.set(Operator, operator) etc.

How can I get an instance of Configuration object/ JobConf object inside
a map method so that I can retrieve these variables.


In your Mapper class, implement a method like:
 public void configure(JobConf job) { ... }

This will be called when the object is created with the job conf.

-- Owen




Re: getting Configuration object in mapper

2008-12-05 Thread Sagar Naik

check : mapred.task.is.map

Craig Macdonald wrote:
I have a related question - I have a class which is both mapper and 
reducer. How can I tell in configure() if the current task is map or a 
reduce task? Parse the taskid?


C

Owen O'Malley wrote:


On Dec 4, 2008, at 9:19 PM, abhinit wrote:


I have set some variable using the JobConf object.

jobConf.set(Operator, operator) etc.

How can I get an instance of Configuration object/ JobConf object 
inside

a map method so that I can retrieve these variables.


In your Mapper class, implement a method like:
 public void configure(JobConf job) { ... }

This will be called when the object is created with the job conf.

-- Owen






Re: slow shuffle

2008-12-05 Thread Songting Chen
We have 4 testing data nodes with 3 reduce tasks. The parallel.copies parameter 
has been increased to 20,30, even 50. But it doesn't really help...


--- On Fri, 12/5/08, Aaron Kimball [EMAIL PROTECTED] wrote:

 From: Aaron Kimball [EMAIL PROTECTED]
 Subject: Re: slow shuffle
 To: core-user@hadoop.apache.org
 Date: Friday, December 5, 2008, 12:28 PM
 How many reduce tasks do you have? Look into increasing
 mapred.reduce.parallel.copies from the default of 5 to
 something more like
 20 or 30.
 
 - Aaron
 
 On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen
 [EMAIL PROTECTED]wrote:
 
  A little more information:
 
  We optimized our Map process quite a bit that now the
 Shuffle becomes the
  bottleneck.
 
  1. There are 300 Map jobs (128M size block), each
 takes about 13 sec.
  2. The Reducer starts running at a very late stage
 (80% maps are done)
  3. copy 300 map outputs (shuffle) takes as long as the
 entire map process,
  although each map output is just about 50Kbytes
 
 
 
 
 
  --- On Fri, 12/5/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard [EMAIL PROTECTED]
   Subject: Re: slow shuffle
   To: core-user@hadoop.apache.org
   Date: Friday, December 5, 2008, 11:43 AM
   These configuration options will be useful:
  
   property
   
  
 namemapred.job.shuffle.merge.percent/name
  value0.66/value
  descriptionThe usage threshold at
 which an
   in-memory merge will be
  initiated, expressed as a percentage of
 the total
   memory allocated to
  storing in-memory map outputs, as defined
 by
  mapred.job.shuffle.input.buffer.percent.
  /description
/property
   
property
   
  
 namemapred.job.shuffle.input.buffer.percent/name
  value0.70/value
  descriptionThe percentage of
 memory to be
   allocated from the maximum
heap
  size to storing map outputs during the
 shuffle.
  /description
/property
   
property
   
  
 namemapred.job.reduce.input.buffer.percent/name
  value0.0/value
  descriptionThe percentage of
 memory-
   relative to the maximum heap size-
to
  retain map outputs during the reduce. When
 the
   shuffle is concluded, any
  remaining map outputs in memory must
 consume less
   than this threshold
before
  the reduce can begin.
  /description
/property
   
  
   How long did the shuffle take relative to the
 rest of the
   job?
  
   Alex
  
   On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen
   [EMAIL PROTECTED]wrote:
  
We encountered a bottleneck during the
 shuffle phase.
   However, there is not
much data to be shuffled across the network
 at all -
   total less than
10MBytes (the combiner aggregated most of
 the data).
   
Are there any parameters or anything we can
 tune to
   improve the shuffle
performance?
   
Thanks,
-Songting
   
 


Re: slow shuffle

2008-12-05 Thread Songting Chen
I think one of the issues is that the Reducer starts very late in the process, 
slowing the entire job significantly. 

Is there a way to let reducer start earlier?


--- On Fri, 12/5/08, Songting Chen [EMAIL PROTECTED] wrote:

 From: Songting Chen [EMAIL PROTECTED]
 Subject: Re: slow shuffle
 To: core-user@hadoop.apache.org
 Date: Friday, December 5, 2008, 1:27 PM
 We have 4 testing data nodes with 3 reduce tasks. The
 parallel.copies parameter has been increased to 20,30, even
 50. But it doesn't really help...
 
 
 --- On Fri, 12/5/08, Aaron Kimball
 [EMAIL PROTECTED] wrote:
 
  From: Aaron Kimball [EMAIL PROTECTED]
  Subject: Re: slow shuffle
  To: core-user@hadoop.apache.org
  Date: Friday, December 5, 2008, 12:28 PM
  How many reduce tasks do you have? Look into
 increasing
  mapred.reduce.parallel.copies from the default of 5 to
  something more like
  20 or 30.
  
  - Aaron
  
  On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen
  [EMAIL PROTECTED]wrote:
  
   A little more information:
  
   We optimized our Map process quite a bit that now
 the
  Shuffle becomes the
   bottleneck.
  
   1. There are 300 Map jobs (128M size block), each
  takes about 13 sec.
   2. The Reducer starts running at a very late
 stage
  (80% maps are done)
   3. copy 300 map outputs (shuffle) takes as long
 as the
  entire map process,
   although each map output is just about 50Kbytes
  
  
  
  
  
   --- On Fri, 12/5/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
 [EMAIL PROTECTED]
Subject: Re: slow shuffle
To: core-user@hadoop.apache.org
Date: Friday, December 5, 2008, 11:43 AM
These configuration options will be useful:
   
property

   
 
 namemapred.job.shuffle.merge.percent/name
   value0.66/value
   descriptionThe usage
 threshold at
  which an
in-memory merge will be
   initiated, expressed as a percentage
 of
  the total
memory allocated to
   storing in-memory map outputs, as
 defined
  by
  
 mapred.job.shuffle.input.buffer.percent.
   /description
 /property

 property

   
 
 namemapred.job.shuffle.input.buffer.percent/name
   value0.70/value
   descriptionThe percentage of
  memory to be
allocated from the maximum
 heap
   size to storing map outputs during
 the
  shuffle.
   /description
 /property

 property

   
 
 namemapred.job.reduce.input.buffer.percent/name
   value0.0/value
   descriptionThe percentage of
  memory-
relative to the maximum heap size-
 to
   retain map outputs during the reduce.
 When
  the
shuffle is concluded, any
   remaining map outputs in memory must
  consume less
than this threshold
 before
   the reduce can begin.
   /description
 /property

   
How long did the shuffle take relative to
 the
  rest of the
job?
   
Alex
   
On Fri, Dec 5, 2008 at 11:17 AM, Songting
 Chen
[EMAIL PROTECTED]wrote:
   
 We encountered a bottleneck during the
  shuffle phase.
However, there is not
 much data to be shuffled across the
 network
  at all -
total less than
 10MBytes (the combiner aggregated most
 of
  the data).

 Are there any parameters or anything we
 can
  tune to
improve the shuffle
 performance?

 Thanks,
 -Songting

  


Re: slow shuffle

2008-12-05 Thread Songting Chen
To summarize the slow shuffle issue:

1. I think one problem is that the Reducer starts very
late in the process, slowing the entire job significantly. 

   Is there a way to let reducer start earlier?

2. Copying 300 files with 30K each took total 3 mins (after all map finished). 
This really puzzles me what's behind the scene. (note that sorting takes  1 
sec)

Thanks,
-Songting

 
 
 --- On Fri, 12/5/08, Songting Chen
 [EMAIL PROTECTED] wrote:
 
  From: Songting Chen [EMAIL PROTECTED]
  Subject: Re: slow shuffle
  To: core-user@hadoop.apache.org
  Date: Friday, December 5, 2008, 1:27 PM
  We have 4 testing data nodes with 3 reduce tasks. The
  parallel.copies parameter has been increased to 20,30,
 even
  50. But it doesn't really help...
  
  
  --- On Fri, 12/5/08, Aaron Kimball
  [EMAIL PROTECTED] wrote:
  
   From: Aaron Kimball [EMAIL PROTECTED]
   Subject: Re: slow shuffle
   To: core-user@hadoop.apache.org
   Date: Friday, December 5, 2008, 12:28 PM
   How many reduce tasks do you have? Look into
  increasing
   mapred.reduce.parallel.copies from the default of
 5 to
   something more like
   20 or 30.
   
   - Aaron
   
   On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen
   [EMAIL PROTECTED]wrote:
   
A little more information:
   
We optimized our Map process quite a bit
 that now
  the
   Shuffle becomes the
bottleneck.
   
1. There are 300 Map jobs (128M size block),
 each
   takes about 13 sec.
2. The Reducer starts running at a very late
  stage
   (80% maps are done)
3. copy 300 map outputs (shuffle) takes as
 long
  as the
   entire map process,
although each map output is just about
 50Kbytes
   
   
   
   
   
--- On Fri, 12/5/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
  [EMAIL PROTECTED]
 Subject: Re: slow shuffle
 To: core-user@hadoop.apache.org
 Date: Friday, December 5, 2008, 11:43
 AM
 These configuration options will be
 useful:

 property
 

  
 
 namemapred.job.shuffle.merge.percent/name
value0.66/value
descriptionThe usage
  threshold at
   which an
 in-memory merge will be
initiated, expressed as a
 percentage
  of
   the total
 memory allocated to
storing in-memory map outputs,
 as
  defined
   by
   
  mapred.job.shuffle.input.buffer.percent.
/description
  /property
 
  property
 

  
 
 namemapred.job.shuffle.input.buffer.percent/name
value0.70/value
descriptionThe
 percentage of
   memory to be
 allocated from the maximum
  heap
size to storing map outputs
 during
  the
   shuffle.
/description
  /property
 
  property
 

  
 
 namemapred.job.reduce.input.buffer.percent/name
value0.0/value
descriptionThe
 percentage of
   memory-
 relative to the maximum heap size-
  to
retain map outputs during the
 reduce.
  When
   the
 shuffle is concluded, any
remaining map outputs in memory
 must
   consume less
 than this threshold
  before
the reduce can begin.
/description
  /property
 

 How long did the shuffle take relative
 to
  the
   rest of the
 job?

 Alex

 On Fri, Dec 5, 2008 at 11:17 AM,
 Songting
  Chen
 [EMAIL PROTECTED]wrote:

  We encountered a bottleneck during
 the
   shuffle phase.
 However, there is not
  much data to be shuffled across
 the
  network
   at all -
 total less than
  10MBytes (the combiner aggregated
 most
  of
   the data).
 
  Are there any parameters or
 anything we
  can
   tune to
 improve the shuffle
  performance?
 
  Thanks,
  -Songting
 
   


File loss at Nebraska

2008-12-05 Thread Brian Bockelman
We are continuing to see a small, consistent amount of block  
corruption leading to file loss.  We have been upgrading our cluster  
lately, which means we've been doing a rolling de-commissioning of our  
nodes (and then adding them back with more disks!).


Previously, when I've had time to investigate this very deeply, I've  
found issues like these:


https://issues.apache.org/jira/browse/HADOOP-4692
https://issues.apache.org/jira/browse/HADOOP-4543

I suspect that this causes some or all of our problems.

I also saw that one of our nodes was at 100.2% full; I think this is  
due to the same issue; Hadoop's actual usage of the file system is  
greater than the max capacity because some of the blocks were truncated.


I'd have to check with our sysadmins, but I think we've lost about  
200-300 files during the upgrade process.  Right now, there are about  
900 chronically under-replicated blocks; in the past, that's meant the  
only replica is actually corrupt, and Hadoop is trying to relentlessly  
retransfer it, failing to, but not realizing the source is corrupt.   
To some extent, this whole issue is caused because we only have enough  
space for 2 replicas; I'd imagine that at 3 replicas, the issue would  
be much harder to trigger.


Any suggestions?  For us, file loss is something we can deal with (not  
necessarily fun to deal with, of course), but it might not be the case  
in the future.


Brian


Re: Issues with V0.19 upgrade

2008-12-05 Thread Michael Bieniosek
Not sure if anyone else answered...

1. You need to run hadoop dfsadmin -finalizeUpgrade.  Be careful, because you 
can't go back once you do this.

http://wiki.apache.org/hadoop/Hadoop_Upgrade

I don't know about 2.

-Michael

On 12/3/08 5:49 PM, Songting Chen [EMAIL PROTECTED] wrote:

1. The namenode webpage shows:

   Upgrades: Upgrade for version -18 has been completed.
   Upgrade is not finalized.

2. SequenceFile.Writer failed when trying to creating a new file with the 
following error: (we have two HaDoop clusters, both have issue 1; one has issue 
2, but the other is fine on issue 2). Any idea what's going on?

Thanks,
-Songting

java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
at 
org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
at 
org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)





Block not found during commitBlockSynchronization

2008-12-05 Thread Brian Bockelman

Hey,

I'm seeing this message repeated over and over in my logs:

2008-12-05 19:20:00,534 INFO  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:  
commitBlockSynchronization(lastblock=blk_-4236881263392665762_88597,  
newgenerationstamp=0, newlength=0, newtargets=[])
2008-12-05 19:20:00,534 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 29 on 9000, call  
commitBlockSynchronization(blk_-4236881263392665762_88597, 0, 0,  
false, true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@67537412)  
from 172.16.1.184:57586: error: java.io.IOException: Block  
(=blk_-4236881263392665762_88597) not found

java.io.IOException: Block (=blk_-4236881263392665762_88597) not found
	at  
org 
.apache 
.hadoop 
.hdfs 
.server 
.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java: 
1898)
	at  
org 
.apache 
.hadoop 
.hdfs 
.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:410)

at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

What can I do to debug?

Brian


Re: Block not found during commitBlockSynchronization

2008-12-05 Thread Tsz Wo (Nicholas), Sze
Which version are you using?

Calling commitBlockSynchronization(...) with newgenerationstamp=0, newlength=0, 
newtargets=[] does not look normal.  You may check the namenode log and the 
client log about the block blk_-4236881263392665762.

Nicholas Sze




- Original Message 
 From: Brian Bockelman [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, December 5, 2008 5:22:03 PM
 Subject: Block not found during commitBlockSynchronization
 
 Hey,
 
 I'm seeing this message repeated over and over in my logs:
 
 2008-12-05 19:20:00,534 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
 commitBlockSynchronization(lastblock=blk_-4236881263392665762_88597, 
 newgenerationstamp=0, newlength=0, newtargets=[])
 2008-12-05 19:20:00,534 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 29 
 on 9000, call commitBlockSynchronization(blk_-4236881263392665762_88597, 0, 
 0, 
 false, true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@67537412) from 
 172.16.1.184:57586: error: java.io.IOException: Block 
 (=blk_-4236881263392665762_88597) not found
 java.io.IOException: Block (=blk_-4236881263392665762_88597) not found
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:1898)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:410)
 at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
 
 What can I do to debug?
 
 Brian



Re: Block not found during commitBlockSynchronization

2008-12-05 Thread Brian Bockelman

This is 0.19.0.

Grepping around, it appears that message for this block has been  
printed 1-5 Hz throughout all our logs (oldest logs are 12-3).  Has  
happened about .5 million times.  If I grep for the   
nextGenerationStamp error message, it's happened .4M times.


Anything else I can provide?

Brian

On Dec 5, 2008, at 8:31 PM, Tsz Wo (Nicholas), Sze wrote:


Which version are you using?

Calling commitBlockSynchronization(...) with newgenerationstamp=0,  
newlength=0, newtargets=[] does not look normal.  You may check the  
namenode log and the client log about the block  
blk_-4236881263392665762.


Nicholas Sze




- Original Message 

From: Brian Bockelman [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, December 5, 2008 5:22:03 PM
Subject: Block not found during commitBlockSynchronization

Hey,

I'm seeing this message repeated over and over in my logs:

2008-12-05 19:20:00,534 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_-4236881263392665762_88597,
newgenerationstamp=0, newlength=0, newtargets=[])
2008-12-05 19:20:00,534 INFO org.apache.hadoop.ipc.Server: IPC  
Server handler 29
on 9000, call  
commitBlockSynchronization(blk_-4236881263392665762_88597, 0, 0,
false, true,  
[Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@67537412) from

172.16.1.184:57586: error: java.io.IOException: Block
(=blk_-4236881263392665762_88597) not found
java.io.IOException: Block (=blk_-4236881263392665762_88597) not  
found

   at
org 
.apache 
.hadoop 
.hdfs 
.server 
.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java: 
1898)

   at
org 
.apache 
.hadoop 
.hdfs 
.server.namenode.NameNode.commitBlockSynchronization(NameNode.java: 
410)

   at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
   at
sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

What can I do to debug?

Brian