I have a large number of gzipped web server logs on NFS that I need to pull
into HDFS for analysis by MapReduce.  What is the most efficient way to do
this?

It seems like what I should do is:

hadoop fs -copyFromLocal *.gz /my/HDFS/directory

A couple of questions:

   1. Is this single process, or will the files be copied up in parallel?
   2. Gzip is not a desirable compression format because it's not
   splittable. What's the best way to get these files into a better format?
   Should I run zcat > bzip before calling copyFromLocal or write a Hadoop job?

Reply via email to