Hey, thanks Joey for that information. Would work on what you said.

Regards
Naveen Mahale

On Wed, Sep 14, 2011 at 5:32 PM, Joey Echeverria <[email protected]> wrote:

> Hi Naveen,
>
> > I use hadoop-0.21.0 distribution. I have a large number of small files
> (KB).
>
> Word of warning, 0.21 is not a stable release. The recommended version
> is in the 0.20.x range.
>
> > Is there any efficient way of handling it in hadoop?
> >
> > I have heard that solution for that problem is using:
> >            1. HAR (hadoop archives)
> >            2. cat on files
> >
> > I would like to know if there are any other solutions for processing
> large
> > number of small files.
>
> You could also stick each file as a record in a sequence file. The
> name of the file becomes the key, the bytes of the file the value.
> That gives you compression and splitability, but not random access.
> You already noted HAR, which does give you random access.
>
> -Joey
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Reply via email to