On Tue, Mar 23, 2010 at 9:00 AM, Mohamed Riadh Trad
<[email protected]>wrote:

> Hi,
>
> I am running hadoop over a collection of several millions of small files
> using the CombineFileInputFormat.
>
> However, when generating splits, the job fails because of a Garbage
> Collector Overhead limit exceed exception.
>
> I disabled the Garbage Colelctor overhead limit exception with -server
> -XX:-UseGCOverheadLimit; I get a java.lang.OutOfMemoryError: Java heap space
> with -Xmx8192m -server.
>
> Is there any solution to avoid this limit when splitting input?
>

You can directly start inheriting from InputFormat and create your
InputSplit-s / RecordReader-s , accordingly.

When we say a million of small files, you can define a set of custom
InputSplits based on higher level logic , but your record reader would be
cumbersome w.r.t ( nextKeyValue() /currentKey() / currentValue()
implementations), but at least you have better control over the behavior.

But if you have the data on HDFS , you may have to rethink about having
large number of small files in the first place / look for some archiving
options that can help your InputSplit / RecordReaders relatively simple.





>
> Regards
>
>
>
>

Reply via email to