I have about 24k gz files (about 550GB total) on hdfs and has a really simple
java program to convert them into sequence files.  If the script's
setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
error at about 35% map complete.  If I make the script process 2k files per
job and run 12 jobs consecutively, then it goes through all files fine.  The
cluster I'm using has about 67 nodes.  Each nodes has 16GB memory, max 7
map, and max 2 reduce.

The map task is really simple, it takes LongWritable as key and Text as
value, generate a Text newKey, and output.collect(Text newKey, Text value). 
It doesn't have any code that can possibly leak memory.

There's no stack trace for the vast majority of the OutOfMemory error,
there's just a single line in the log like this:

2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
java.lang.OutOfMemoryError: Java heap space

I can't find the stack trace right now, but rarely the OutOfMemory error
originates from some hadoop config array copy opertaion.  There's no special
config for the script.
-- 
View this message in context: 
http://www.nabble.com/OutOfMemory-error-processing-large-amounts-of-gz-files-tp22193552p22193552.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to