Hi,

I have a job that processes raw data inside tarballs. As job input I have a
text file listing the full HDFS path of the files that need to be processed,
e.g.:
...
/user/eric/file451.tar.gz
/user/eric/file452.tar.gz
/user/eric/file453.tar.gz
...

Each mapper gets one line of input at a time, moves the tarball to local
storage, unpacks it and processes all files inside.
This works very well. However: changes are high that a mapper gets to
process a file that is not stored locally on that node so it needs to be
transferred.

My question: is there any way to get better locality in a job as described
above?

Best regards,
Eric

Reply via email to