Harsh, I am assuming you mean the web-interface of the jobtracker, right? What I see there is appended at the end of the email. Is there supposed to be a counter which is equal to the number of data-local jobs? One obvious way to find this would be to look at the location of the input split of each of the mappers and see if that is the same as that of the map task.
Do I need to enable some config parameter to actually see the counter which shows the number of data-local tasks? Thanks Virajith ================================================================================== Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 100.00% 1600 0 0 1600 0 3 / 46 reduce 100.00% 20 0 0 20 0 0 / 1 Counter Map Reduce Total Job Counters Launched reduce tasks 0 0 21 Rack-local map tasks 0 0 1,649 Launched map tasks 0 0 1,649 FileSystemCounters FILE_BYTES_READ 215,256,891,609 494,340,016,724 709,596,908,333 HDFS_BYTES_READ 215,481,828,554 0 215,481,828,554 FILE_BYTES_WRITTEN 430,057,823,630 494,340,016,724 924,397,840,354 HDFS_BYTES_WRITTEN 0 215,457,161,571 215,457,161,571 Map-Reduce Framework Reduce input groups 0 20,369,713 20,369,713 Combine output records 0 0 0 Map input records 20,443,005 0 20,443,005 Reduce shuffle bytes 0 214,894,166,095 214,894,166,095 Reduce output records 0 20,443,005 20,443,005 Spilled Records 40,886,010 46,997,605 87,883,615 Map output bytes 214,913,316,171 0 214,913,316,171 Map input bytes 215,457,082,591 0 215,457,082,591 Map output records 20,443,005 0 20,443,005 Combine input records 0 0 0 Reduce input records 0 20,443,005 20,443,005 On Tue, Jul 12, 2011 at 2:43 PM, Harsh J <ha...@cloudera.com> wrote: > Virajith, > > You can see the number of data local vs. non.'s counters in the job itself. > > On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti > <virajit...@gmail.com> wrote: > > How do I find the number of data-local map tasks that are launched? I > > checked the log files but didnt see any information about this. All the > map > > tasks are rack local since I am running the job just using a single rack. > > From the completion time per map (comparing it to the case where I have > > 1Gbps of bandwidth between the nodes i.e. the case where network > bandwidth > > is not a bottle neck), I saw that more than 90% of the maps are actually > > reading data over the network. > > > > I understand that there might be some maps that are actually launched as > > non-data local task but I am surprised that around 90% of the maps are > > actually running as non-data local tasks. > > > > I have not measured how much bandwidth was being used but I think the > whole > > 50Mbps is being used. > > > > Thanks, > > Virajith > > > > > > On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <ha...@cloudera.com> wrote: > >> > >> How much of bandwidth did you see being utilized? What was the count > >> of number of tasks launched as data-local map tasks versus rack local > >> ones? > >> > >> A little bit of edge record data is always read over network but that > >> is highly insignificant compared to the amount of data read locally (a > >> whole block size, if available). > >> > >> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti > >> <virajit...@gmail.com> wrote: > >> > Hi, > >> > > >> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of > >> > input > >> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB > >> > block > >> > size (so 1600maps are created) and a replication factor of 1 is being > >> > used. > >> > All the 20 nodes are also hdfs datanodes. I was using a bandwidth > value > >> > of > >> > 50Mbps between each of the nodes (this was configured using linux > "tc"). > >> > I > >> > see that around 90% of the map tasks are reading data over the network > >> > i.e. > >> > most of the map tasks are not being scheduled at the nodes where the > >> > data to > >> > be processed by them is located. > >> > My understanding was that Hadoop tries to schedule as many data-local > >> > maps > >> > as possible. But in this situation, this does not seem to happen. Any > >> > reason > >> > why this is happening? and is there a way to actually configure hadoop > >> > to > >> > ensure the maximum possible node locality? > >> > Any help regarding this is very much appreciated. > >> > > >> > Thanks, > >> > Virajith > >> > > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J >