Virajith, You can see the number of data local vs. non.'s counters in the job itself.
On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti <virajit...@gmail.com> wrote: > How do I find the number of data-local map tasks that are launched? I > checked the log files but didnt see any information about this. All the map > tasks are rack local since I am running the job just using a single rack. > From the completion time per map (comparing it to the case where I have > 1Gbps of bandwidth between the nodes i.e. the case where network bandwidth > is not a bottle neck), I saw that more than 90% of the maps are actually > reading data over the network. > > I understand that there might be some maps that are actually launched as > non-data local task but I am surprised that around 90% of the maps are > actually running as non-data local tasks. > > I have not measured how much bandwidth was being used but I think the whole > 50Mbps is being used. > > Thanks, > Virajith > > > On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <ha...@cloudera.com> wrote: >> >> How much of bandwidth did you see being utilized? What was the count >> of number of tasks launched as data-local map tasks versus rack local >> ones? >> >> A little bit of edge record data is always read over network but that >> is highly insignificant compared to the amount of data read locally (a >> whole block size, if available). >> >> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti >> <virajit...@gmail.com> wrote: >> > Hi, >> > >> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of >> > input >> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB >> > block >> > size (so 1600maps are created) and a replication factor of 1 is being >> > used. >> > All the 20 nodes are also hdfs datanodes. I was using a bandwidth value >> > of >> > 50Mbps between each of the nodes (this was configured using linux "tc"). >> > I >> > see that around 90% of the map tasks are reading data over the network >> > i.e. >> > most of the map tasks are not being scheduled at the nodes where the >> > data to >> > be processed by them is located. >> > My understanding was that Hadoop tries to schedule as many data-local >> > maps >> > as possible. But in this situation, this does not seem to happen. Any >> > reason >> > why this is happening? and is there a way to actually configure hadoop >> > to >> > ensure the maximum possible node locality? >> > Any help regarding this is very much appreciated. >> > >> > Thanks, >> > Virajith >> > >> >> >> >> -- >> Harsh J > > -- Harsh J