Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark
On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer <f...@infochimps.org>wrote: > Could you tar.bz2 them up (setting up the tar so that it made a few dozen > files), toss them onto the HDFS, and use > http://stuartsierra.com/2008/04/24/a-million-little-files > to go into SequenceFile? > > This lets you preserve the originals and do the sequence file conversion > across the cluster. It's only really helpful, of course, if you also want > to > prepare a .tar.bz2 so you can clear out the sprawl > > flip > > On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerz...@gmail.com> > wrote: > > > Hi, > > > > I am writing an application to copy all files from a regular PC to a > > SequenceFile. I can surely do this by simply recursing all directories on > > my > > PC, but I wonder if there is any way to parallelize this, a MapReduce > task > > even. Tom White's books seems to imply that it will have to be a custom > > application. > > > > Thank you, > > Mark > > > > > > -- > http://www.infochimps.org > Connected Open Free Data >