[ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568383#action_12568383
 ] 

Vivek Ratan commented on HADOOP-2119:
-------------------------------------

We need to clarify what assumptions are made by the various proposals, how 
they're different and similar, and figure out the best approach. After 
discussions between Owen, Arun and myself, we feel the following assumptions 
are valid:

- Failed tasks (tasks and TIPs are used interchangeably in this discussion) 
should be considered before virgin tasks where possible. 
- Failed tasks need not be sorted on input size. In fact, there isn't any one 
obvious way in which they should be sorted. 
- Similarly,  there isn't any one obvious way in which running tasks should be 
sorted. Maybe the tasks that have run the longest are better candidates for 
speculation, but that's not obvious. 

After looking at the various data structures that have been suggested, the 
following are recommended: 

1. A single data structure for virgin and failed tasks is recommended (call 
this the runnable list). There are two options:

1a. We use a structure similar to that for the cache. Have a hashmap of 
hostnames to linked lists, where each list contains the runnable tasks for the 
host. 
- Each list is sorted in decreasing order of input size at the beginning, as 
all tasks are virgin. 
- A task can occur in a list for more than one host. 
- When a task is selected to run (this is the first task in the list, unless 
that task is marked as running on another host, in which case it is removed 
from the list and the next task is considered), it is removed from the runnable 
list. This is an O(1) operation. 
- A task can be inserted into a runnable list when it fails, in which case it 
is inserted into the beginning of the list (O(1)). 
- A runnable list can end up such that failed tasks occur before virgin tasks, 
and the failed tasks are sorted by when they failed (the ones that failed later 
occur earlier in the list), and not by input size. Thus the highest priority is 
for a failed task that failed most recently.
- This list can be singly linked, as we insert and delete at the head. 
- We can have a single hashmap that keeps tracks of lists for hosts or racks. 
The key for the hashmap is a unique name of the host or rack. By using names 
similar to absolute paths of files in directories, this structure can support 
multiple hierarchies of racks. 

1b. Another implementation option is a 2-D sparse matrix as Owen suggested 
earlier. Only difference between this structure and the one proposed in 1a is 
that tasks are additionally connected to other tasks in their column. This 
makes it easy to delete all tasks related to the same data input. When a task 
is selected to run, we can remove all other related tasks (tasks in the same 
column) very easily. 

2. For running tasks (call this the running list), we have the following 
options: 

2a. Keep a global list of running tasks and walk through it to pick a 
speculative task, much as we do today (the only difference being that 
currently, we walk through an array of all tasks, whereas with this approach, 
we walk through a list of only running tasks). 
- When a tasks starts to run, it is removed from the runnable list to a running 
list, and can be inserted at the head or tail. This is O(1). 
- When a task completes (or fails), we need to locate it in this list and 
remove it. 
- Perhaps the best way to implement this is for each TIP object to have a prev 
and next pointer. That makes location of a task in this list O(1), and removal 
is O(1) too as the list is doubly linked. 
- To figure out a speculative task, we walk through this list as before and 
pick the first task that works (we have to add options for rack awareness to 
the current algorithm). This is not O(1), but is no worse off than what we have 
today. 

2b. Use a 2-D sparse matrix, as described earlier, to represent possible 
speculative tasks. 
- When a task is selected to run from the runnable list, copies of it are 
inserted into the running list for each host where the task can be speculated 
(one for each host where the input is replicated, at the very least). Insertion 
into the matrix is O(1) if you have doubly linked lists for rows, and if you 
have an array, indexed by task ID, that points to the head of the column. 
- When a task completes (or fails), it is removed from this list, as are all 
tasks linked to it in its column. This takes constant time. 
- Finding a speculative task is O(1), as you pick the first task in the running 
list for the host, or for its rack. If nothing matches, pick a task from any 
arbitrary host. 

As far as implementation recommendations go, it makes sense to implement 1a and 
2a now. Both use existing code and do not have a lot of changes to make. 
Picking a speculative task will be slow and not locality aware, but it will be 
no worse than what we have today, and implementation will be quick. At some 
time, depending on need, we can implement 1b and 2b, both of which refer to the 
same data structure. This requires more coding and testing, but will make the 
process of finding a speculative task faster. 


> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to