I have about 24k gz files (about 550GB total) on hdfs and has a really simple java program to convert them into sequence files. If the script's setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory error at about 35% map complete. If I make the script process 2k files per job and run 12 jobs consecutively, then it goes through all files fine. The cluster I'm using has about 67 nodes. Each nodes has 16GB memory, max 7 map, and max 2 reduce.
The map task is really simple, it takes LongWritable as key and Text as value, generate a Text newKey, and output.collect(Text newKey, Text value). It doesn't have any code that can possibly leak memory. There's no stack trace for the vast majority of the OutOfMemory error, there's just a single line in the log like this: 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker: java.lang.OutOfMemoryError: Java heap space I can't find the stack trace right now, but rarely the OutOfMemory error originates from some hadoop config array copy opertaion. There's no special config for the script. -- View this message in context: http://www.nabble.com/OutOfMemory-error-processing-large-amounts-of-gz-files-tp22193552p22193552.html Sent from the Hadoop core-user mailing list archive at Nabble.com.