[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Owen O'Malley (JIRA) Thu, 07 Feb 2008 19:43:33 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566901#action_12566901
 ]


Owen O'Malley commented on HADOOP-2119:
---------------------------------------

I think we can do better than that, by using a special data structure that 
isn't that complicated.

I propose that we use a 2-d sparse matrix, where each row is location (node or 
rack) and the columns correspond to a task in progress (TIP) that is currently 
runnable, but not running. I'd make the rows a doubly linked circular list and 
the columns a singularly linked circular list. So let's say the operations are:

{code}
class LocationTable {
  // add to the front of the lists for all of the locations
  public void addToFront(TaskInProgress tip, String[] locations);
  // add it to the back of the lists at all of the locations
  public void addToBack(TaskInProgress tip, String[] locations);
  // get the first task in the given location and remove it from all of the 
lists
  public TaskInProgress getFront(String location);
}
{code}

All of the locations involve doing a look up to find the list and a O(1) 
operation to modify all of the lists. *Doing deletes out of a doubly linked 
list is very fast.* If we use a hash table from the location name to the front 
of the list for that location, then the lookup is also O(1).

I think we should solve HADOOP-2014 at the same time 
http://issues.apache.org/jira/browse/HADOOP-2014?focusedCommentId=12566814#action_12566814

So the order would be:
  1. Look at the node local list O(1)
  2. Look at the rack local list O(1)
  3. Look at the most overloaded rack from HADOOP-2014 O(# racks)

Between the 3 of them, you'll always find a task if there are any to run. 
Update for all of the lists is O(1), regardless of how you found it.

When tasks fail, you put them back at the front of all of the relevant lists.

Which leaves the question of speculative execution... I suspect a LocationTable 
with the currently running tasks would work pretty well.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Reply via email to