[
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amar Kamat updated HADOOP-3245:
-------------------------------
Attachment: HADOOP-3245-v5.13.patch
Attaching a patch that implements JT restart using JobHistory.
_Changes :_
Currently the job history filename is of the following format
history-timestamp_jt-hostname_jobid_username_jobname. It was introduced in
HADOOP-239 and the timestamp was added in the beginning since the job names
were not unique. It makes it difficult to guess the job history filename with
history-timestamp. So history-timestamp is removed as currently job-id is
unique across restarts.
So for now we define
master-file = jt-hostname_jobid_username_jobname.
tmp-file = master-file.tmp
_Working :_
0) Upon restart the JT goes in _safe_ mode. In safe mode all the trackers are
asked to resend/replay their heartbeat.
1) For a new job, the history file is the _master-file_. For a restarted job,
the history is written to the _tmp_ file.
2) Following checks are made for a recovered job
2.1) If the master file exists then delete the tmp file
2.2) If the master file is missing then make the tmp file as master
3) Upon restart the master-file is read and default-history-parser is used to
parse and recover history records. These records are used to create taskStatus
which is replayed in order. Before replaying the JT waits for the jobs to be
inited.
4) Once the replay is over, delete the _master-file_ to indicate that the _tmp_
file is more recent. Note that on next restart the _tmp_ file will be used for
recovery.
5) Once all the jobs are recovered, turn off the safe mode. JT will now process
heartbeats (called as successful re-connect). Also the registration window
timer starts. JT waits for _tracker-expiry-interval_ time after
_last-tracker-re-connect_ before closing the window. Once the window closes, JT
is considered as _recovered_. This plays an important role in detecting the
trackers that went down while the JT was down. Upon _recovery_, JT re-executes
all the tasks that were on the lost trackers.
6) Since the history can have some data missing, there can be a case where the
_map-completion-event-list_ at the JT is smaller than the one at the tracker.
Hence there is a rollback required upon restart. Once the JT is out of safe
mode, it passes this information (_map-events-list-size_) to the tracker on the
successful reconnect.
7) The tasktracker rollbacks few events and asks the child tasks to reset their
index to 0. Child tasks fetches all the events back and filters out necessary
events for further processing. This is similar to the one discussed in approach
#1.
8) Errors in history can cause the parser to fail. We have HADOOP-2403 to
address this. For now this patch encodes errors. This will replaced with the
fix in HADOOP-2403.
9) Currently counters are stringified and written to history. It is not
possible to recover the counter back from the string and hence this patch
encodes the counter-names so that they can be easily recovered. Note that there
is no encoding in the user space. Only the frameworks history file has codes.
10) Once the job finishes the _tmp_ file is renamed to _master-file_. Similarly
the history files in the user directory also follow the same renaming cycle.
11) Job priority is logged on every change and hence its recovered.
_Issues :_
1) This approach/patch works fine with history on local fs. With history on
HDFS, the history file becomes visible but not available (i.e file-size = 0).
The file becomes available only on close(). Sync() documentation indicates that
the file-data availability is not guaranteed.
2) Detecting job runtime is still an issue.
We are working on it.
_Todo :_
1) Refactor common code.
2) Remove extra logs
3) For ease of testing JT killing facility is added to web-ui. There is some
extra code to support this. Clear it out.
4) To test the usage of {{sync()}}, there are periodic syncs done to the
history files. This is just for testing.
5) Optimize encoding/decoding.
6) Group together all the recovery code under something like
{{JobTrackerRecoveryManager}}.
----
Note that the logs/debugging-code/testing-code is still a part of this patch as
I am testing it.
> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
> Key: HADOOP-3245
> URL: https://issues.apache.org/jira/browse/HADOOP-3245
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Amar Kamat
> Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch,
> HADOOP-3245-v2.6.9.patch, HADOOP-3245-v4.1.patch, HADOOP-3245-v5.13.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be
> applied for things like jobs being able to survive jobtracker restarts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.