Yes. That (2) should work well.
On 4/23/08 8:55 AM, "Stuart Sierra" <[EMAIL PROTECTED]> wrote: > Hello all, Hadoop newbie here, asking: what's the preferred way to > handle large (~1 million) collections of small files (10 to 100KB) in > which each file is a single "record"? > > 1. Ignore it, let Hadoop create a million Map processes; > 2. Pack all the files into a single SequenceFile; or > 3. Something else? > > I started writing code to do #2, transforming a big tar.bz2 into a > BLOCK-compressed SequenceFile, with the file names as keys. Will that > work? > > Thanks, > -Stuart, altlaw.org