[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Vivek Ratan (JIRA) Wed, 06 Feb 2008 22:57:31 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566472#action_12566472
 ]


Vivek Ratan commented on HADOOP-2119:
-------------------------------------


>> This is unacceptable. We used to have this behavior and it was very bad. If 
>> map 0 fails deterministically, you absolutely do not want all 200,000 maps 
>> to fail 4 times each before the job fails.

That's a valid point. However, today's code doesn't handle this situation 
completely well either. If you're treating a virgin task and a failed task with 
the same priority, then it's quite possible that you run a number of virgin 
tasks before you get to a failed task. You should really be prioritizing failed 
tasks over virgin tasks if you want to avoid the situation you've mentioned, 
though such a policy may cause unfairness in other situations. 

I think what we really need is what I mentioned for Options 3 and 4 - separate 
data structures to hold virgin tasks, failed tasks, and running tasks. 
- The data structure for virgin tasks does not need dynamic insertion 
capabilities (i.e., once created, you will never add any task to it). It also 
needs to be sorted by size. This structure needs to support deletion (once a 
task runs, it will be removed from this structure), and it also needs to handle 
large sizes (up to the number of tasks). 
- The data structure for failed tasks need to support quick insertions and 
deletions, but will be smaller in size (as the number of failed tasks is 
typically low). 
- The data structure for running tasks need to support quick insertions and 
deletions, and it's size is bounded (by the number of task trackers), though 
not necessarily small. Note that when a task is running and its status changes, 
we'll need to find it quickly in the running list  so we can move it to 
another. The data structure should optimize a lookup as much as possible. 

Given all this, Devaraj, Amar, and I feel that the following data structure 
choices are the most appropriate: 
- We leave the cache as is.
- For virgin tasks, we keep the tasks array, and have a running index to it (as 
suggested in Option 1). We could alternately have a doubly linked list 
structure with some sort of indexing (a list implemented as an array). 
- For failed tasks, we can keep a linked list or a sorted structure such as a 
Java sorted set. If we're OK not needing to keep failed tasks sorted by size 
(we could sort them by when they ran), then we can just keep a linked list and 
insert at the end. 
- For running tasks, given that the structure size is potentially larger than 
that for failed tasks, we can still keep a sorted map, or have some sort of a 
composite structure (which Devaraj/Amar have discussed, and can describe). 
Again, if we're OK not needing to keep running tasks sorted by size (we could 
sort them by when they ran), then we can just keep a linked list and insert at 
the end. 

By having multiple lists and keeping the array for virgin tasks, we're using 
more memory, but if that becomes a problem, we can use some other structure to 
represent virgin tasks. 


> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Reply via email to