I have a large number of gzipped web server logs on NFS that I need to pull into HDFS for analysis by MapReduce. What is the most efficient way to do this?
It seems like what I should do is: hadoop fs -copyFromLocal *.gz /my/HDFS/directory A couple of questions: 1. Is this single process, or will the files be copied up in parallel? 2. Gzip is not a desirable compression format because it's not splittable. What's the best way to get these files into a better format? Should I run zcat > bzip before calling copyFromLocal or write a Hadoop job?
