Re: Lack of data locality in Hadoop-0.20.2

Virajith Jalaparti Tue, 12 Jul 2011 06:07:29 -0700

How do I find the number of data-local map tasks that are launched? I
checked the log files but didnt see any information about this. All the map
tasks are rack local since I am running the job just using a single rack.
>From the completion time per map (comparing it to the case where I have
1Gbps of bandwidth between the nodes i.e. the case where network bandwidth
is not a bottle neck), I saw that more than 90% of the maps are actually
reading data over the network.


I understand that there might be some maps that  are actually launched as
non-data local task but  I am surprised that around 90% of the maps are
actually running as non-data local tasks.

I have not measured how much bandwidth was being used but I think the whole
50Mbps is being used.

Thanks,
Virajith


On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <ha...@cloudera.com> wrote:

> How much of bandwidth did you see being utilized? What was the count
> of number of tasks launched as data-local map tasks versus rack local
> ones?
>
> A little bit of edge record data is always read over network but that
> is highly insignificant compared to the amount of data read locally (a
> whole block size, if available).
>
> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
> <virajit...@gmail.com> wrote:
> > Hi,
> >
> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB
> block
> > size (so 1600maps are created) and a replication factor of 1 is being
> used.
> > All the 20 nodes are also hdfs datanodes. I was using a bandwidth value
> of
> > 50Mbps between each of the nodes (this was configured using linux "tc").
> I
> > see that around 90% of the map tasks are reading data over the network
> i.e.
> > most of the map tasks are not being scheduled at the nodes where the data
> to
> > be processed by them is located.
> > My understanding was that Hadoop tries to schedule as many data-local
> maps
> > as possible. But in this situation, this does not seem to happen. Any
> reason
> > why this is happening? and is there a way to actually configure hadoop to
> > ensure the maximum possible node locality?
> > Any help regarding this is very much appreciated.
> >
> > Thanks,
> > Virajith
> >
>
>
>
> --
> Harsh J
>

Re: Lack of data locality in Hadoop-0.20.2

Reply via email to