What is the most efficient way to copy a large number of .gz files into HDFS?

W.P. McNeill Fri, 12 Aug 2011 10:29:38 -0700

I have a large number of gzipped web server logs on NFS that I need to pull
into HDFS for analysis by MapReduce.  What is the most efficient way to do
this?


It seems like what I should do is:

hadoop fs -copyFromLocal *.gz /my/HDFS/directory

A couple of questions:

   1. Is this single process, or will the files be copied up in parallel?
   2. Gzip is not a desirable compression format because it's not
   splittable. What's the best way to get these files into a better format?
   Should I run zcat > bzip before calling copyFromLocal or write a Hadoop job?

What is the most efficient way to copy a large number of .gz files into HDFS?

Reply via email to