[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920098#action_12920098
 ] 

luoli commented on MAPREDUCE-2116:
----------------------------------

!getTaskToKill.JPG!

This is really the issue, we encounter the same problem. Here is the picture of 
heartbeat call profiling, In one of our busy cluster, the getTaskToKill call 
consume about 27/32 cpu time of a heartbeat process in jobtracker. And we found 
that the TreeMap's get operate consume the most of the cpu time. one of the 
TreeMap is the taskidToTIPMap in jobtracker which keeps all the TaskAttemptID 
-> TIP map and will grow huge when the cluster running lots of job and each job 
contain hundreds of thousands of tasks. Also we found that in the shouldClose 
of TIP, the TreeMap of taskStatuses's get also very slow , and I don't know why 
is that because this treemap will contain two entry at most.

For the taskidToTIPMap's consume, what if we keep a ref in TaskAttemptID class 
to point to its TIP ref? If so , the call of taskidToTIPMap.get will become 
unnecessary since we can get a TaskAttemptID's corresponding TIP from it's own 
object. Like TaskAttemptID.getTaskInProgress()

I will upload a patch base of trunk soon
any Thoughts?

> optimize getTasksToKill to reduce JobTracker contention
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-2116
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2116
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>            Reporter: Joydeep Sen Sarma
>         Attachments: getTaskToKill.JPG
>
>
> getTasksToKill shows up as one of the top routines holding the JT lock. 
> Specifically, the translation from attemptid to tip is very expensive:
>         at java.util.TreeMap.getEntry(TreeMap.java:328)
>         at java.util.TreeMap.get(TreeMap.java:255)
>         at 
> org.apache.hadoop.mapred.TaskInProgress.shouldClose(TaskInProgress.java:500)
>         at 
> org.apache.hadoop.mapred.JobTracker.getTasksToKill(JobTracker.java:3464)
>           locked <0x00002aab6ebb6640> (a org.apache.hadoop.mapred.JobTracker)
>         at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3181)
> this seems like an avoidable expense since the tip for a given attempt is 
> fixed (and one should not need a map lookup to find the association). on a 
> different note - not clear to me why TreeMaps are in use here (i didn't find 
> any iteration over these maps). any background info on why things are 
> arranged the way they are would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to