I think there's no way to avoid this limit if you have several millions of small files.
You might know that at least one InputSplit instance is created for one file, so if there are several millions of small files, you might have several millions of InputSplit instance, this should consume much gigabytes of memory. You should tar your small files and you should implement input formatter for those tar files. One of my colleagues had a similar case to handle a number of small files, he tarred those files and made input formatter, and he finally solved the problem. I like to suggest him to open those sources but I am not sure that his manager would permit it :) 2010/3/24 Mohamed Riadh Trad <[email protected]>: > Hi, > > I am running hadoop over a collection of several millions of small files > using the CombineFileInputFormat. > > However, when generating splits, the job fails because of a Garbage Collector > Overhead limit exceed exception. > > I disabled the Garbage Colelctor overhead limit exception with -server > -XX:-UseGCOverheadLimit; I get a java.lang.OutOfMemoryError: Java heap space > with -Xmx8192m -server. > > Is there any solution to avoid this limit when splitting input? > > Regards > > > >
