Yup, that's correct. A rack, then, can be defined however you like.
One possibility is that a rack is defined by hosts on the same subnet.
Cheers,
Nige
On Feb 14, 2007, at 1:47 PM, Vasiliy Baranov wrote:
Hi Nigel,
It is so nice to hear from you again!
Thank you for clarifying these. I have a follow up question. How
are racks configured, that is, how does the system know which rack
a machine is in. I went through the proposal (https://
issues.apache.org/jira/secure/attachment/12345251/
Rack_aware_HDFS_proposal.pdf) and patch source code (https://
issues.apache.org/jira/secure/attachment/12350262/rack.patch), and
it looks like the DataNode is implemented so that it receives its
rack's "network location" via the new -r/--rack option or, if the
latter is not specified, by Runtime.execing the
"dfs.network.script" script. If both are not specified, the
DataNode belongs to the default rack. Correct?
Thank you,
Vasiliy
Nigel Daley wrote:
Hi Vasiliy :)
I have a question regarding task allocation to TaskTrackers
(could not find an answer in the docs). When a MapReduce job is
run, does the system attempt to schedule a Map task on a machine
that contains a replica of the task's input data, or not?
Yes, the JobTracker attempts to schedule the map on a node
containing that map's input split.
If yes, how does the system know which TaskTracker corresponds to
which DataNode (by IP address, by host name, or by something else)?
See InputSlit.getLocations() (http://svn.apache.org/viewvc/lucene/
hadoop/trunk/src/java/org/apache/hadoop/mapred/InputSplit.java?
view=markup). Currently, host names are used, but I believe it's
moving to IP address (see https://issues.apache.org/jira/browse/
HADOOP-985).
Also, what happens if that fails?
The task is schedule elsewhere. However, now that DataNodes are
aware of the rack they are on (as of 0.11.0), the JobTracker needs
to be modified so that its fallback is to attempt to locate the
map on a node "close" (same rack) as its data.
Cheers,
Nige