Could you tar.bz2 them up (setting up the tar so that it made a few dozen
files), toss them onto the HDFS, and use
http://stuartsierra.com/2008/04/24/a-million-little-files
to go into SequenceFile?

This lets you preserve the originals and do the sequence file conversion
across the cluster. It's only really helpful, of course, if you also want to
prepare a .tar.bz2 so you can clear out the sprawl

flip

On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <[email protected]> wrote:

> Hi,
>
> I am writing an application to copy all files from a regular PC to a
> SequenceFile. I can surely do this by simply recursing all directories on
> my
> PC, but I wonder if there is any way to parallelize this, a MapReduce task
> even. Tom White's books seems to imply that it will have to be a custom
> application.
>
> Thank you,
> Mark
>



-- 
http://www.infochimps.org
Connected Open Free Data

Reply via email to