How do I find the number of data-local map tasks that are launched? I checked the log files but didnt see any information about this. All the map tasks are rack local since I am running the job just using a single rack. >From the completion time per map (comparing it to the case where I have 1Gbps of bandwidth between the nodes i.e. the case where network bandwidth is not a bottle neck), I saw that more than 90% of the maps are actually reading data over the network.
I understand that there might be some maps that are actually launched as non-data local task but I am surprised that around 90% of the maps are actually running as non-data local tasks. I have not measured how much bandwidth was being used but I think the whole 50Mbps is being used. Thanks, Virajith On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <ha...@cloudera.com> wrote: > How much of bandwidth did you see being utilized? What was the count > of number of tasks launched as data-local map tasks versus rack local > ones? > > A little bit of edge record data is always read over network but that > is highly insignificant compared to the amount of data read locally (a > whole block size, if available). > > On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti > <virajit...@gmail.com> wrote: > > Hi, > > > > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input > > data using a 20 node cluster of nodes. HDFS is configured to use 128MB > block > > size (so 1600maps are created) and a replication factor of 1 is being > used. > > All the 20 nodes are also hdfs datanodes. I was using a bandwidth value > of > > 50Mbps between each of the nodes (this was configured using linux "tc"). > I > > see that around 90% of the map tasks are reading data over the network > i.e. > > most of the map tasks are not being scheduled at the nodes where the data > to > > be processed by them is located. > > My understanding was that Hadoop tries to schedule as many data-local > maps > > as possible. But in this situation, this does not seem to happen. Any > reason > > why this is happening? and is there a way to actually configure hadoop to > > ensure the maximum possible node locality? > > Any help regarding this is very much appreciated. > > > > Thanks, > > Virajith > > > > > > -- > Harsh J >