[ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568621#action_12568621
 ] 

Vivek Ratan commented on HADOOP-2119:
-------------------------------------

The only difference between 1a (we need to give that a name - let's call it the 
'cache map', as it's mostly based on how the cache is implemented)  and a 
sparse matrix is that in the latter, each TIP is linked to the TIP below it in 
the same column. And that the linked list for a row in a spare matrix is doubly 
linked (you need it to efficiently delete tasks in the running list) while in 
1a, the runnable list is a singly linked list. Given that, I would vote for the 
cache map, for the following reasons: 
- My big concern with implementing a sparse matrix now is that you're 
implementing a brand new data structure. Given how core this functionality is, 
and time constraints, it's riskier to introduce such newness in the code. 
- You already have most of the code in place for cache map, it's been there for 
a while and tested in production. That gives me a lot more comfort than putting 
in brand new code. 
- In terms of performance, the only difference between a linked list for 
running tasks (2a) and a sparse matrix is for speculative tasks, where the 
latter performs better. However, it's not clear to me how much this will 
reflect in the overall performance. it seems like the effect of lower 
performance of a linked list may be extremely minimal in the overall scheme of 
things, so why throw in new code? It's better, iMO, to see whether this 
performance is indeed significant before making big changes. 
- As I mentioned earlier, the cache map and sparse matrix are almost identical. 
I don't see a sparse matrix being any more simple or elegant than a cache map, 
i.e., i see both as fairly simple and elegant structures. 

I agree that a sparse matrix is the better option for speculative tasks, and 
that it may be also useful in the future for more complex scheduling decisions, 
 as Arun points out. However, because it requires new code and in such a 
central/core functionality, I'd recommend a more cautious approach of using the 
tried&tested code you already have to solve most, if not all, of the problems 
you're facing today, and looking at a sparse matrix implementation if the need 
is great. New code always brings it problems with testing and implementation 
and potential side effects. 

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to