Hi, I might be misunderstanding how scheduling is supposed to work, or I might have something misconfigured, but my Map/Reduce jobs don't seem to run where my data is located.
I get a bunch of these messages: INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201106062049_0001_ m_000021 has split on node:/rack1/rack1node1.local ... indicating it has correctly found the source data at my node /rack1/rack1node1 (the only copy of the data - for the purpose of this experiment I have set dfs.replication = dfs.replication.min = dfs.replication.max = 1 so I only have 1 replica). However, it then goes on to run the JOB_SETUP, MAP, REDUCE, JOB_CLEANUP tasks on abitrary tasktrackers, usually not where the data is located, so the first thing they have to do is pull it over the network from another node. Did I miss something - or hopefully configure something wrong? :) Ian