Moving to mapreduce-dev@ (bcc common-dev@). Responses inline:
On Mar 29, 2010, at 7:02 PM, Mike Cardosa wrote:
1) When the jobtracker assigns a task to a tasktracker, it determines if the task is data-local or rack-local from the splits (which were generated during the job init process). Where in the code could I "refresh" the split locations in case they have changed or blocks have been replicated to additional new datanodes?
No easy way to do that. But in practice, I don't think it matters much.
2) When a tasktracker is assigned a map task, is it informed if it's a data-local or rack-local map task? If so, where in the code does this take place, and is it possible to patch the code to have it check to see if it has a data-local copy of the block first before going to the network to download the block from another datanode?
No, the TT doesn't know/care. The DFSClient in the Map has the smarts to do the i/o from the 'nearest' datanode.
Arun