Re: Lack of data locality in Hadoop-0.20.2

Harsh J Tue, 12 Jul 2011 05:55:54 -0700

How much of bandwidth did you see being utilized? What was the count
of number of tasks launched as data-local map tasks versus rack local
ones?


A little bit of edge record data is always read over network but that
is highly insignificant compared to the amount of data read locally (a
whole block size, if available).

On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
<virajit...@gmail.com> wrote:
> Hi,
>
> I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
> data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
> size (so 1600maps are created) and a replication factor of 1 is being used.
> All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
> 50Mbps between each of the nodes (this was configured using linux "tc"). I
> see that around 90% of the map tasks are reading data over the network i.e.
> most of the map tasks are not being scheduled at the nodes where the data to
> be processed by them is located.
> My understanding was that Hadoop tries to schedule as many data-local maps
> as possible. But in this situation, this does not seem to happen. Any reason
> why this is happening? and is there a way to actually configure hadoop to
> ensure the maximum possible node locality?
> Any help regarding this is very much appreciated.
>
> Thanks,
> Virajith
>



-- 
Harsh J

Re: Lack of data locality in Hadoop-0.20.2

Reply via email to