[ 
https://issues.apache.org/jira/browse/HADOOP-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-1874:
--------------------------------

    Attachment: lazy-dfs-ops.patch

Attached is an early version of the patch. One thing that we noticed was that 
the JobTracker's DFS operations like saveOutput and discardOutput were taking 
too long in the large cluster. The unfortunate thing was that the JobTracker 
stays locked when the DFS operation is happening. So that might result in lost 
trackers since the JobTracker is not able to process any RPC that requires 
taking a lock on itself. This patch moves out the DFS ops to a separate thread. 
It also introduces a new task state called DFS_OPS_PENDING. This is the state 
after the task completes and before the JT does the DFS ops. The state helps to 
prevent anomalies like marking a task as successful before the dfs operation 
(rename) is done. Only when the dfs rename completes successfully is a task 
considered successful. The state is also used to not run speculative tasks 
(since we know that the task is in the last stage). 
This patch is up for review. It is not well tested yet.

> lost task trackers -- jobs hang
> -------------------------------
>
>                 Key: HADOOP-1874
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1874
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.0
>            Reporter: Christian Kunz
>            Assignee: Devaraj Das
>            Priority: Blocker
>         Attachments: lazy-dfs-ops.patch
>
>
> This happens on a 1400 node cluster using a recent nightly build patched with 
> HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a 
> c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get 
> lost in high numbers at the end of job completion.
> Similar non-pipes job do not show the same problem, but is unclear whether it 
> is related to c++-pipes. It could also be dfs overload when reduce tasks 
> close and validate all newly created dfs files. I see dfs client rpc timeout 
> exception. But this alone does not explain the escalation in losing task 
> trackers.
> I also noticed that the job tracker becomes rather unresponsive with rpc 
> timeout and call queue overflow exceptions. Job Tracker is running with 60 
> handlers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to