[
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amar Kamat updated HADOOP-3245:
-------------------------------
Attachment: HADOOP-3245-v5.14.patch
Had an offline discussion with Devaraj and Hemanth. Attaching a patch the
incorporates their comments which are as follows
1) Avoid safe mode by delaying the start of ipc server. The ipc server starts
only after the JobTracker has recovered. This avoids any extra coding at the
tasktracker side.
2) Avoid having any registration window as TrackerExpiryThread will take care
of the tracker that were lost while the tracker was down.
3) Remove unnecessary changes done to the Task/TaskAttempt logging with respect
to passing of counters
----
Things taken care from the todo list
1) Re-factored out the code related to recovery under RecoveryManager
----
Things that need more work/discussion
1) Is safe mode required? Whether we want to start the ipc server early is the
question we need to answer. Starting it early will allow JobClient and
TaskTracker to connect to the JobTracker. Its the JobTracker's responsibility
to handle the connection. It could either throw an exception or could reply
with a _dummy_ response. Apart from the fact that the JobClient can now detect
that the JT is up but under maintenance and take some specific actions, there
seems no reason to have the ipc services running before recovery (i.e to have
the safe mode)
2) W.r.t point #7 in my earlier comment
([here|https://issues.apache.org/jira/browse/HADOOP-3245?focusedCommentId=12620042#action_12620042])
it seems that the time to detect the previously killed tasks will depend on
2.1) number of reducers
2.2) Reducers ability to report back the fetch failures
It seems we can do better by asking the trackers about the list of maps that
are currently hosted by the tracker. This is the list of tasks that the tracker
that are successful. jobTracker can now kill all the tasks that were not
claimed. We feel this can be dealt in a separate issue.
----
I am currently testing the patch on a larger cluster.
> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
> Key: HADOOP-3245
> URL: https://issues.apache.org/jira/browse/HADOOP-3245
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Amar Kamat
> Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch,
> HADOOP-3245-v2.6.9.patch, HADOOP-3245-v4.1.patch, HADOOP-3245-v5.13.patch,
> HADOOP-3245-v5.14.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be
> applied for things like jobs being able to survive jobtracker restarts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.