Re: Handling of small files in hadoop

Joey Echeverria Wed, 14 Sep 2011 05:03:07 -0700

Hi Naveen,

> I use hadoop-0.21.0 distribution. I have a large number of small files (KB).


Word of warning, 0.21 is not a stable release. The recommended version
is in the 0.20.x range.

> Is there any efficient way of handling it in hadoop?
>
> I have heard that solution for that problem is using:
>            1. HAR (hadoop archives)
>            2. cat on files
>
> I would like to know if there are any other solutions for processing large
> number of small files.

You could also stick each file as a record in a sequence file. The
name of the file becomes the key, the bytes of the file the value.
That gives you compression and splitability, but not random access.
You already noted HAR, which does give you random access.

-Joey



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Handling of small files in hadoop

Reply via email to