Re: Lack of data locality in Hadoop-0.20.2

Harsh J Tue, 12 Jul 2011 06:44:33 -0700

Virajith,

You can see the number of data local vs. non.'s counters in the job itself.


On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
<virajit...@gmail.com> wrote:
> How do I find the number of data-local map tasks that are launched? I
> checked the log files but didnt see any information about this. All the map
> tasks are rack local since I am running the job just using a single rack.
> From the completion time per map (comparing it to the case where I have
> 1Gbps of bandwidth between the nodes i.e. the case where network bandwidth
> is not a bottle neck), I saw that more than 90% of the maps are actually
> reading data over the network.
>
> I understand that there might be some maps that  are actually launched as
> non-data local task but  I am surprised that around 90% of the maps are
> actually running as non-data local tasks.
>
> I have not measured how much bandwidth was being used but I think the whole
> 50Mbps is being used.
>
> Thanks,
> Virajith
>
>
> On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> How much of bandwidth did you see being utilized? What was the count
>> of number of tasks launched as data-local map tasks versus rack local
>> ones?
>>
>> A little bit of edge record data is always read over network but that
>> is highly insignificant compared to the amount of data read locally (a
>> whole block size, if available).
>>
>> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
>> <virajit...@gmail.com> wrote:
>> > Hi,
>> >
>> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
>> > input
>> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB
>> > block
>> > size (so 1600maps are created) and a replication factor of 1 is being
>> > used.
>> > All the 20 nodes are also hdfs datanodes. I was using a bandwidth value
>> > of
>> > 50Mbps between each of the nodes (this was configured using linux "tc").
>> > I
>> > see that around 90% of the map tasks are reading data over the network
>> > i.e.
>> > most of the map tasks are not being scheduled at the nodes where the
>> > data to
>> > be processed by them is located.
>> > My understanding was that Hadoop tries to schedule as many data-local
>> > maps
>> > as possible. But in this situation, this does not seem to happen. Any
>> > reason
>> > why this is happening? and is there a way to actually configure hadoop
>> > to
>> > ensure the maximum possible node locality?
>> > Any help regarding this is very much appreciated.
>> >
>> > Thanks,
>> > Virajith
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Lack of data locality in Hadoop-0.20.2

Reply via email to