Hi, I have a job that processes raw data inside tarballs. As job input I have a text file listing the full HDFS path of the files that need to be processed, e.g.: ... /user/eric/file451.tar.gz /user/eric/file452.tar.gz /user/eric/file453.tar.gz ...
Each mapper gets one line of input at a time, moves the tarball to local storage, unpacks it and processes all files inside. This works very well. However: changes are high that a mapper gets to process a file that is not stored locally on that node so it needs to be transferred. My question: is there any way to get better locality in a job as described above? Best regards, Eric
