[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Amar Kamat (JIRA) Mon, 17 Mar 2008 01:11:10 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579343#action_12579343
 ]


Amar Kamat commented on HADOOP-2119:
------------------------------------

The attached patch does the following
Maps :
1) Replaces {{ArrayList}} with {{LinkedList}} for the currently used caches 
(call it *NR* caches).
2) Failed TIPs are added (if it can be) at the front of the *NR* caches. [for 
fail-early]
3) Removal of a tip from the *NR* caches is on demand i.e remove 
running/non-runnable TIPs while searching for a new TIP.
4) Maintains a new set of caches called *R* caches for running TIPs. This 
caches are similar to the *NR* caches but provides faster removal. Additions to 
the caches are in the form of appends. Removal is one shot i.e a non-running 
TIP is removed at once from all the *R* caches. [for speculation]

Reduces :
1) Maintains a LinkedList of non-running reducers i.e *NR* cache. [for 
non-running tasks]
2) Failed reducers are added to the front of *NR* cache. [for fail-early]
3) Maintains a set of running reducers with faster removal capability. [for 
speculation]
----
Also,
1) Search preference is as follows {{FAILED}}, {{NON-RUNNING}}, {{RUNNING}}
2) Search order is as follows 
{noformat}
    1. Search local cache i.e strong locality
    2. Search bottom-up (i.e from the node's parent to the node's top level 
ancestor) for a TIP i.e weak locality.
    3. Search breadth wise across top-level ancestors for a TIP i.e for a non 
local TIP.
{noformat}
3) Introducing a _default-node_. TIP's that are not local to any of the node 
are local to default node. This node takes care of random-writer like cases i.e 
adapting the random-writer like cases to the cache structure. _default-node_ 
belongs to _default-rack_ and hence all the nodes share the non-local TIPs 
through _default-rack_.
4) The JobTracker need not be synchronized for providing reports to the 
JobClient and hence these API's doesn't lock the JT. Some staleness is okay.
5) Commits are now in batches. But batching takes fixed number of tasks at a 
time. Default is 5000. So at a time 5000 tasks will be batch committed. The 
reason for doing this 'fixed sized batching' is that committing too many TIPs 
in one go locks the JobTracker for a very long duration causing *lost 
rpc/tracker* issues.
6) TIPs use trackers hostname instead of tracker name for maintaining the list 
of machines where the TIP failed.
7) One major bottleneck which we observed was in 
{{JobInProgress.isJobComplete()}} where all the TIPs were scanned. This is 
costly since {{isJobComplete()}} is called once every completed/failed task 
(via {{TaskCommit}} thread) and proves costly in case of large number of maps. 
Now this check is done by using the counts of finished TIPs.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: HADOOP-2119-v4.1.patch, hadoop-2119.patch, 
> hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Reply via email to