As Aaron mentioned the scheduler has very little leeway when you have a single replica.
OTOH, schedulers equate rack-locality to node-locality - this makes sense sense for a large-scale system since intra-rack b/w is good enough for most installs of Hadoop. Arun On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote: > I am using a replication factor of 1 since I dont to incur the overhead of > replication and I am not much worried about reliability. > > I am just using the default Hadoop scheduler (FIFO, I think!). In case of a > single rack, rack-locality doesn't really have any meaning. Obviously > everything will run in the same rack. I am concerned about data-local maps. I > assumed that Hadoop would do a much better job at ensuring data-local maps > but it doesnt seem to be the case here. > > -Virajith > > On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy <a...@hortonworks.com> wrote: > Why are you running with replication factor of 1? > > Also, it depends on the scheduler you are using. The CapacityScheduler in > 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with > FairScheduler. > > IAC, running on a single rack with replication of 1 implies rack-locality for > all tasks which, in most cases, is good enough. > > Arun > > On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote: > > > Hi, > > > > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input > > data using a 20 node cluster of nodes. HDFS is configured to use 128MB > > block size (so 1600maps are created) and a replication factor of 1 is being > > used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth > > value of 50Mbps between each of the nodes (this was configured using linux > > "tc"). I see that around 90% of the map tasks are reading data over the > > network i.e. most of the map tasks are not being scheduled at the nodes > > where the data to be processed by them is located. > > My understanding was that Hadoop tries to schedule as many data-local maps > > as possible. But in this situation, this does not seem to happen. Any > > reason why this is happening? and is there a way to actually configure > > hadoop to ensure the maximum possible node locality? > > Any help regarding this is very much appreciated. > > > > Thanks, > > Virajith > >