On Tue, Mar 23, 2010 at 9:00 AM, Mohamed Riadh Trad <[email protected]>wrote:
> Hi, > > I am running hadoop over a collection of several millions of small files > using the CombineFileInputFormat. > > However, when generating splits, the job fails because of a Garbage > Collector Overhead limit exceed exception. > > I disabled the Garbage Colelctor overhead limit exception with -server > -XX:-UseGCOverheadLimit; I get a java.lang.OutOfMemoryError: Java heap space > with -Xmx8192m -server. > > Is there any solution to avoid this limit when splitting input? > You can directly start inheriting from InputFormat and create your InputSplit-s / RecordReader-s , accordingly. When we say a million of small files, you can define a set of custom InputSplits based on higher level logic , but your record reader would be cumbersome w.r.t ( nextKeyValue() /currentKey() / currentValue() implementations), but at least you have better control over the behavior. But if you have the data on HDFS , you may have to rethink about having large number of small files in the first place / look for some archiving options that can help your InputSplit / RecordReaders relatively simple. > > Regards > > > >
