[
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568621#action_12568621
]
Vivek Ratan commented on HADOOP-2119:
-------------------------------------
The only difference between 1a (we need to give that a name - let's call it the
'cache map', as it's mostly based on how the cache is implemented) and a
sparse matrix is that in the latter, each TIP is linked to the TIP below it in
the same column. And that the linked list for a row in a spare matrix is doubly
linked (you need it to efficiently delete tasks in the running list) while in
1a, the runnable list is a singly linked list. Given that, I would vote for the
cache map, for the following reasons:
- My big concern with implementing a sparse matrix now is that you're
implementing a brand new data structure. Given how core this functionality is,
and time constraints, it's riskier to introduce such newness in the code.
- You already have most of the code in place for cache map, it's been there for
a while and tested in production. That gives me a lot more comfort than putting
in brand new code.
- In terms of performance, the only difference between a linked list for
running tasks (2a) and a sparse matrix is for speculative tasks, where the
latter performs better. However, it's not clear to me how much this will
reflect in the overall performance. it seems like the effect of lower
performance of a linked list may be extremely minimal in the overall scheme of
things, so why throw in new code? It's better, iMO, to see whether this
performance is indeed significant before making big changes.
- As I mentioned earlier, the cache map and sparse matrix are almost identical.
I don't see a sparse matrix being any more simple or elegant than a cache map,
i.e., i see both as fairly simple and elegant structures.
I agree that a sparse matrix is the better option for speculative tasks, and
that it may be also useful in the future for more complex scheduling decisions,
as Arun points out. However, because it requires new code and in such a
central/core functionality, I'd recommend a more cautious approach of using the
tried&tested code you already have to solve most, if not all, of the problems
you're facing today, and looking at a sparse matrix implementation if the need
is great. New code always brings it problems with testing and implementation
and potential side effects.
> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
> Key: HADOOP-2119
> URL: https://issues.apache.org/jira/browse/HADOOP-2119
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Reporter: Runping Qi
> Assignee: Amar Kamat
> Priority: Critical
> Fix For: 0.17.0
>
> Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap
> space limit).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.