Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile?
This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <[email protected]> wrote: > Hi, > > I am writing an application to copy all files from a regular PC to a > SequenceFile. I can surely do this by simply recursing all directories on > my > PC, but I wonder if there is any way to parallelize this, a MapReduce task > even. Tom White's books seems to imply that it will have to be a custom > application. > > Thank you, > Mark > -- http://www.infochimps.org Connected Open Free Data
