Hi Hadoop Streaming users, Accordingly to Hadoop Streaming FAQ, http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions if I want to embarrassingly parallelize my task and assign one process per map, I can do "Hadoop Streaming and custom mapper script":
- Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input. - Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory I understand that if I use a long list of full HDFS path as an input file to hadoop-streaming, the file will be chopped into piece of the ~same size to feed each mapper. The bigger question is how can we make sure that files in each piece are sitting local to the mapper, i.e., data locality? If not, the files will be fetched over the network, that is definitely not scalable for data-intensive application, and defeat the purpose of using Hadoop. I am questioning the rationale of this "workaround" in the FAQ. Please advice. Thanks Qiming
