Re: Lack of data locality in Hadoop-0.20.2

Arun C Murthy Tue, 12 Jul 2011 10:15:48 -0700

As Aaron mentioned the scheduler has very little leeway when you have a single 
replica.


OTOH, schedulers equate rack-locality to node-locality - this makes sense sense 
for a large-scale system since intra-rack b/w is good enough for most installs 
of Hadoop.

Arun

On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote:

> I am using a replication factor of 1 since I dont to incur the overhead of 
> replication and I am not much worried about reliability. 
> 
> I am just using the default Hadoop scheduler (FIFO, I think!). In case of a 
> single rack, rack-locality doesn't really have any meaning. Obviously 
> everything will run in the same rack. I am concerned about data-local maps. I 
> assumed that Hadoop would do a much better job at ensuring data-local maps 
> but it doesnt seem to be the case here.
> 
> -Virajith
> 
> On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy <a...@hortonworks.com> wrote:
> Why are you running with replication factor of 1?
> 
> Also, it depends on the scheduler you are using. The CapacityScheduler in 
> 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with 
> FairScheduler.
> 
> IAC, running on a single rack with replication of 1 implies rack-locality for 
> all tasks which, in most cases, is good enough.
> 
> Arun
> 
> On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:
> 
> > Hi,
> >
> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input 
> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB 
> > block size (so 1600maps are created) and a replication factor of 1 is being 
> > used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth 
> > value of 50Mbps between each of the nodes (this was configured using linux 
> > "tc"). I see that around 90% of the map tasks are reading data over the 
> > network i.e. most of the map tasks are not being scheduled at the nodes 
> > where the data to be processed by them is located.
> > My understanding was that Hadoop tries to schedule as many data-local maps 
> > as possible. But in this situation, this does not seem to happen. Any 
> > reason why this is happening? and is there a way to actually configure 
> > hadoop to ensure the maximum possible node locality?
> > Any help regarding this is very much appreciated.
> >
> > Thanks,
> > Virajith
> 
>

Re: Lack of data locality in Hadoop-0.20.2

Reply via email to