Re: Best practices for handling many small files

Ted Dunning Wed, 23 Apr 2008 09:15:22 -0700

Yes.  That (2) should work well.


On 4/23/08 8:55 AM, "Stuart Sierra" <[EMAIL PROTECTED]> wrote:

> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
> 
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
> 
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
> 
> Thanks,
> -Stuart, altlaw.org

Re: Best practices for handling many small files

Reply via email to