[
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566120#action_12566120
]
Vivek Ratan commented on HADOOP-2119:
-------------------------------------
Regarding finding the next task to run:
The problem is that we scan the array of TaskInProgress(TIP) objects (the
'tasks' array) from the beginning, each time we need to find a task to run and
we haven't found one in the cache. When you have lots and lots of tasks, after
some time the tasks array will predominantly be filled with running or
completed tasks, especially at lower indexes. So it takes longer and longer to
scan the array for virgin tasks (tasks that haven't run yet) or failed tasks
if we always start from the beginning. There are a number of ways to get around
this:
1. On a cache miss, if you don't mind returning a virgin task ahead of a failed
task, even though the failed task occurs earlier in the array than the virgin
task, then you can do the following. Keep an index (a pointer) into the tasks
array, which points to the first task in the array that is a virgin task. Every
TIP to the left of this index will have been run at least once. Every time we
need to scan the array, we look for runnable tasks from this index onwards.
This index will move across the length of the array just once during the
running of all tasks, so it averages out to O(1) for finding the first virgin
task. This is fairly easy to code, and should provide significant savings in
the cases where there are lots and lots of tasks. If a virgin task is not
found, you can use the same code to look for a failed task or a speculative
task.
2. If you want to keep the same algorithm for finding the next task, i.e., if
you want to return either a virgin or failed task first on a cache miss, then
you can keep an index into the array which represents the first failed or
virgin task. This approach won't be as fast as the first one, since this index
can move back and forth across the array, but it should still be better over
what we do today. In both options, the index will have to be potentially
updated whenever a task's status changes.
A more detailed approach requires maintaining separate lists for tasks with
different states: one for virgin tasks, one for failed, one for completed, and
one for running. As a task's status changes, it moves from one list to another.
These lists can be sorted (the list of virgin tasks should be sorted in
decreasing order of input size; the other lists can be sorted the same way, or
perhaps in order of when the tasks ran, which simplifies things). Picking the
next task is as simple as walking down the list (in many cases, just picking
the first element of a list). Care must be taken that moving an element from
one list to another is as effective as possible. The array of TIPS is no longer
needed. I can think of at least a couple of ways of doing this:
3. Modify a TIP object to have a reference to a 'previous' TIP object and a
'next' TIP object, so that we have linked lists of TIPS. Then a list is just a
reference to the first (and last) TIP object, and a TIP belongs to just one
list. If objects are added into the Running, Completed, and Failed lists at the
tail only, handling a TIP's status change is O(1), but the lists are no longer
sorted by size. Rather they are sorted by when the task ran. Finding the next
task is pretty much constant, as we either pick the first TIP in the Virgin
list, or the first TIP in the failed list (such that the TIP didn't fail on the
host), or we walk through the Running list to find the first speculative task.
If you need the lists sorted by size, then insertion is O(n). This is a fair
bit of change, as far as coding is concerned.
4. Lists can also be implemented as arrays of TIPS. This takes more space but
moving a TIP from one list to another is faster than in the linked-list case.
Insertion into a list sorted by size can be O(log n).
You can also have hybrid approaches where you just keep lists for virgin tasks,
for example.
I think it makes most sense to go with Option 1 for now, as it's the easiest to
implement and makes the most common case run much faster. Options 3 and 4 need
a fair bit of refactoring and may be an overkill for now, since you can get the
most bang for the buck by just making sure that you don't scan the array from
the beginning for virgin tasks.
> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
> Key: HADOOP-2119
> URL: https://issues.apache.org/jira/browse/HADOOP-2119
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Reporter: Runping Qi
> Assignee: Amar Kamat
> Priority: Critical
> Fix For: 0.17.0
>
> Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap
> space limit).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.