[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Amar Kamat (JIRA) Fri, 21 Mar 2008 15:43:28 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581206#action_12581206
 ]


Amar Kamat commented on HADOOP-2119:
------------------------------------

Taken into consideration Owen's comments. Here is what is done
bq. I really wish that the synchronization changes could be done in another 
patch ...
+1. Removed all the synchronization changes. Will open another issue regarding 
the same.
bq. siblingSetAtLevel seems really arcane. I would propose that instead you add 
getChildren to the ...
Maintaining this information at Node level might involve more complexity and 
will require more testing. A concept of children is already there in NodeBase 
but looking at the code it is not very clear what they are for and how to use 
them. Now there is just a single set of nodes at {{maxlevel}} maintained at the 
JobTracker. For now this seems to be a simpler solution.
bq. why is there yet another map from hostname to Node? This is already done in 
the node mapping.
This is done to incur less penalty during the job execution. While the job is 
running the only penalty incurred is for the resolution of datanodes and newly 
joining trackers while resolution of trackers (before the job is submitted) is 
done as a part of heart beat (separate thread). Without this mapping there is 
no way to find out the Node given the hostname. Also I have renamed the 
variable _trackerNameToNodeMap_ which is there in the trunk. I am also using it 
to store the mapping for datanodes mapping too.
bq.  I'm really concerned that we are adding 5 new fields holding collections 
to the JobInProgress
As I said this is required to get away with the array and also that the total 
space is somewhat bounded by the total number of TIPs. Either the TIPs are 
local or not. Also the TIPs are either running or not-running. Mostly they move 
from one list to other. Hence 
_local-maps-non-running + local-maps-running + non-local-maps-non-running + 
non-local-maps-running ~ total-map-tips_
and
_non-running-reduces + running-reduces ~ total-reduce-tips_.
bq. reducers is a really bad name.
Fixed.
bq. nodesToMaps should be runnableMaps
runnable means !failed && not-completed. Running and non-running both belong to 
the runnable category. But I have used a different name for this variable.
bq. Don't use assignment in a parameter to a method in initTasks
Fixed.
bq. I'm bothered by all of the checks for null Nodes that just skip the 
location.
Fixed. Now there are no null checks.
bq. Shouldn't we remove the node from the nodesToMaps regardless of the level?
Consider a case where _tip1_ fails on _host1_. _host1_ belongs to _rack1_. Now 
_host1_ runs out of cached tips and queries _rack1_'s cache. In such a case it 
should not remove the tip since some other tracker in the same rack can 
schedule it.
bq. nodesToMaps being null should be a fatal error
Fixed.
bq. nodesToMaps being null should be a fatal error
Done. In case of misconfiguration (i.e nodesToMaps = null) the JobTracker will 
give a fatal error and shutdown.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, 
> HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, 
> hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap 
> space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Reply via email to