[ 
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amar Kamat updated HADOOP-3245:
-------------------------------

    Attachment: HADOOP-3245-v5.14.patch

Had an offline discussion with Devaraj and Hemanth. Attaching a patch the 
incorporates their comments which are as follows
1) Avoid safe mode by delaying the start of ipc server. The ipc server starts 
only after the JobTracker has recovered. This avoids any extra coding at the 
tasktracker side.
2) Avoid having any registration window as TrackerExpiryThread will take care 
of the tracker that were lost while the tracker was down. 
3) Remove unnecessary changes done to the Task/TaskAttempt logging with respect 
to passing of counters
----
Things taken care from the todo list
1) Re-factored out the code related to recovery under RecoveryManager
----
Things that need more work/discussion
1) Is safe mode required? Whether we want to start the ipc server early is the 
question we need to answer. Starting it early will allow JobClient and 
TaskTracker to connect to the JobTracker. Its the JobTracker's responsibility 
to handle the connection. It could either throw an exception or could reply 
with a _dummy_ response. Apart from the fact that the JobClient can now detect 
that the JT is up but under maintenance and take some specific actions, there 
seems no reason to have the ipc services running before recovery (i.e to have 
the safe mode)
2) W.r.t point #7 in my earlier comment 
([here|https://issues.apache.org/jira/browse/HADOOP-3245?focusedCommentId=12620042#action_12620042])
 it seems that the time to detect the previously killed tasks will depend on 
2.1) number of reducers
2.2) Reducers ability to report back the fetch failures
It seems we can do better by asking the trackers about the list of maps that 
are currently hosted by the tracker. This is the list of tasks that the tracker 
that are successful. jobTracker can now kill all the tasks that were not 
claimed. We feel this can be dealt in a separate issue. 
---- 
I am currently testing the patch on a larger cluster.

> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch, 
> HADOOP-3245-v2.6.9.patch, HADOOP-3245-v4.1.patch, HADOOP-3245-v5.13.patch, 
> HADOOP-3245-v5.14.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be 
> applied for things like jobs being able to survive jobtracker restarts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to