Hi Hadoop Streaming users,

Accordingly to Hadoop Streaming FAQ,
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
if I want to embarrassingly parallelize my task and assign one process per
map, I can do "Hadoop Streaming and custom mapper script":

   - Generate a file containing the full HDFS path of the input files. Each
   map task would get one file name as input.
   - Create a mapper script which, given a filename, will get the file to
   local disk, gzip the file and put it back in the desired output directory


I understand that if I use a long list of full HDFS path as an input file to
hadoop-streaming, the file will be chopped into piece of the ~same size to
feed each mapper.

The bigger question is how can we make sure that files in each piece are
sitting local to the mapper, i.e., data locality? If not, the files will be
fetched over the network, that is definitely not scalable for data-intensive
application, and defeat the purpose of using Hadoop. I am questioning the
rationale of this "workaround" in the FAQ. Please advice.

Thanks

Qiming

Reply via email to