On Fri, Aug 12, 2011 at 1:29 PM, W.P. McNeill <[email protected]> wrote:
> I have a large number of gzipped web server logs on NFS that I need to pull > into HDFS for analysis by MapReduce. What is the most efficient way to do > this? > > It seems like what I should do is: > > hadoop fs -copyFromLocal *.gz /my/HDFS/directory > > A couple of questions: > > 1. Is this single process, or will the files be copied up in parallel? > It will use a single process to do the copy. You could just have multiple -copyFromLocal or moveFromLocal to improve speed. > 2. Gzip is not a desirable compression format because it's not > splittable. What's the best way to get these files into a better format? > Should I run zcat > bzip before calling copyFromLocal or write a Hadoop > job? > If you have lzo working, i would recommend it. Running mapreduce jobs using lzo was measurably quicker in my setup. While bzip2 provides better compression ratios, it is far too cpu intensive compared to lzo/gzip. If you have multiple gzip files, you might still be able to increase parallelizism by having multiple mapper run on the individual gzip files but still be 1 per file. I don't specifically recall if gzip/bzip2 was better in my case. Sridhar
