I am interested in a few things, all pertaining to hdfs block locations for running map tasks. I have spent several days looking through the hadoop source code and have arrived at a couple of questions that are still plaguing me.
1) When the jobtracker assigns a task to a tasktracker, it determines if the task is data-local or rack-local from the splits (which were generated during the job init process). Where in the code could I "refresh" the split locations in case they have changed or blocks have been replicated to additional new datanodes? 2) When a tasktracker is assigned a map task, is it informed if it's a data-local or rack-local map task? If so, where in the code does this take place, and is it possible to patch the code to have it check to see if it has a data-local copy of the block first before going to the network to download the block from another datanode? Thanks for your time in advance. Regards, Mike Cardosa