[
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568383#action_12568383
]
Vivek Ratan commented on HADOOP-2119:
-------------------------------------
We need to clarify what assumptions are made by the various proposals, how
they're different and similar, and figure out the best approach. After
discussions between Owen, Arun and myself, we feel the following assumptions
are valid:
- Failed tasks (tasks and TIPs are used interchangeably in this discussion)
should be considered before virgin tasks where possible.
- Failed tasks need not be sorted on input size. In fact, there isn't any one
obvious way in which they should be sorted.
- Similarly, there isn't any one obvious way in which running tasks should be
sorted. Maybe the tasks that have run the longest are better candidates for
speculation, but that's not obvious.
After looking at the various data structures that have been suggested, the
following are recommended:
1. A single data structure for virgin and failed tasks is recommended (call
this the runnable list). There are two options:
1a. We use a structure similar to that for the cache. Have a hashmap of
hostnames to linked lists, where each list contains the runnable tasks for the
host.
- Each list is sorted in decreasing order of input size at the beginning, as
all tasks are virgin.
- A task can occur in a list for more than one host.
- When a task is selected to run (this is the first task in the list, unless
that task is marked as running on another host, in which case it is removed
from the list and the next task is considered), it is removed from the runnable
list. This is an O(1) operation.
- A task can be inserted into a runnable list when it fails, in which case it
is inserted into the beginning of the list (O(1)).
- A runnable list can end up such that failed tasks occur before virgin tasks,
and the failed tasks are sorted by when they failed (the ones that failed later
occur earlier in the list), and not by input size. Thus the highest priority is
for a failed task that failed most recently.
- This list can be singly linked, as we insert and delete at the head.
- We can have a single hashmap that keeps tracks of lists for hosts or racks.
The key for the hashmap is a unique name of the host or rack. By using names
similar to absolute paths of files in directories, this structure can support
multiple hierarchies of racks.
1b. Another implementation option is a 2-D sparse matrix as Owen suggested
earlier. Only difference between this structure and the one proposed in 1a is
that tasks are additionally connected to other tasks in their column. This
makes it easy to delete all tasks related to the same data input. When a task
is selected to run, we can remove all other related tasks (tasks in the same
column) very easily.
2. For running tasks (call this the running list), we have the following
options:
2a. Keep a global list of running tasks and walk through it to pick a
speculative task, much as we do today (the only difference being that
currently, we walk through an array of all tasks, whereas with this approach,
we walk through a list of only running tasks).
- When a tasks starts to run, it is removed from the runnable list to a running
list, and can be inserted at the head or tail. This is O(1).
- When a task completes (or fails), we need to locate it in this list and
remove it.
- Perhaps the best way to implement this is for each TIP object to have a prev
and next pointer. That makes location of a task in this list O(1), and removal
is O(1) too as the list is doubly linked.
- To figure out a speculative task, we walk through this list as before and
pick the first task that works (we have to add options for rack awareness to
the current algorithm). This is not O(1), but is no worse off than what we have
today.
2b. Use a 2-D sparse matrix, as described earlier, to represent possible
speculative tasks.
- When a task is selected to run from the runnable list, copies of it are
inserted into the running list for each host where the task can be speculated
(one for each host where the input is replicated, at the very least). Insertion
into the matrix is O(1) if you have doubly linked lists for rows, and if you
have an array, indexed by task ID, that points to the head of the column.
- When a task completes (or fails), it is removed from this list, as are all
tasks linked to it in its column. This takes constant time.
- Finding a speculative task is O(1), as you pick the first task in the running
list for the host, or for its rack. If nothing matches, pick a task from any
arbitrary host.
As far as implementation recommendations go, it makes sense to implement 1a and
2a now. Both use existing code and do not have a lot of changes to make.
Picking a speculative task will be slow and not locality aware, but it will be
no worse than what we have today, and implementation will be quick. At some
time, depending on need, we can implement 1b and 2b, both of which refer to the
same data structure. This requires more coding and testing, but will make the
process of finding a speculative task faster.
> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
> Key: HADOOP-2119
> URL: https://issues.apache.org/jira/browse/HADOOP-2119
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Reporter: Runping Qi
> Assignee: Amar Kamat
> Priority: Critical
> Fix For: 0.17.0
>
> Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap
> space limit).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.