[ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566782#action_12566782
 ] 

Devaraj Das commented on HADOOP-2119:
-------------------------------------

I think we should prioritize FAILED tasks over VIRGIN tasks when we miss the 
task-cache. That way Owen's concern will be addressed. Regarding the options 
(5) and (6), one thing to note is this that tasks should be removed from the 
Running tasks datastructure as soon as a task comes to COMMIT_PENDING state. 
This will ensure that the the running tasks datastructure doesn't grow 
indefinitely (since the JT would handle COMMIT_PENDING tasks in the 
background). 

Also, do we care whether speculative tasks are executed in the order of split 
sizes?

Overall, I think (1) + (3) + (5) looks like an approach worth trying out and 
benchmarking. The other thing that might help is to not do delete from the 
datastructure in (5) until we do a scan looking for speculative tasks (batch 
deletes). In general, the percentage of speculative tasks is very small and so 
we might hit O(n) worst case for scan towards the end of the map/reduce phases. 
But should be okay to have a slightly degraded performance when looking for 
speculative tasks if the most frequent operations (looking for virgin/failed 
tasks) are efficient. Thoughts?

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to