Re: best way to copy all files from a file system to hdfs

Mark Kerzner Mon, 02 Feb 2009 07:24:45 -0800

Truly, I do not see any advantage to doing this, as opposed to writing
(Java) code which will copy files to HDFS, because then tarring becomes my
bottleneck. Unless I write code measure the file sizes and prepare pointers
for multiple tarring tasks. It becomes pretty complex though, and I thought
of something simple. I might as well accept that copying one hard drive to
HDFS is not going to be parallelized.
Mark


On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
<f...@infochimps.org>wrote:

> Could you tar.bz2 them up (setting up the tar so that it made a few dozen
> files), toss them onto the HDFS, and use
> http://stuartsierra.com/2008/04/24/a-million-little-files
> to go into SequenceFile?
>
> This lets you preserve the originals and do the sequence file conversion
> across the cluster. It's only really helpful, of course, if you also want
> to
> prepare a .tar.bz2 so you can clear out the sprawl
>
> flip
>
> On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerz...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am writing an application to copy all files from a regular PC to a
> > SequenceFile. I can surely do this by simply recursing all directories on
> > my
> > PC, but I wonder if there is any way to parallelize this, a MapReduce
> task
> > even. Tom White's books seems to imply that it will have to be a custom
> > application.
> >
> > Thank you,
> > Mark
> >
>
>
>
> --
> http://www.infochimps.org
> Connected Open Free Data
>

Re: best way to copy all files from a file system to hdfs

Reply via email to