Hey, thanks Joey for that information. Would work on what you said. Regards Naveen Mahale
On Wed, Sep 14, 2011 at 5:32 PM, Joey Echeverria <[email protected]> wrote: > Hi Naveen, > > > I use hadoop-0.21.0 distribution. I have a large number of small files > (KB). > > Word of warning, 0.21 is not a stable release. The recommended version > is in the 0.20.x range. > > > Is there any efficient way of handling it in hadoop? > > > > I have heard that solution for that problem is using: > > 1. HAR (hadoop archives) > > 2. cat on files > > > > I would like to know if there are any other solutions for processing > large > > number of small files. > > You could also stick each file as a record in a sequence file. The > name of the file becomes the key, the bytes of the file the value. > That gives you compression and splitability, but not random access. > You already noted HAR, which does give you random access. > > -Joey > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
