[ 
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amar Kamat updated HADOOP-3245:
-------------------------------

    Attachment: HADOOP-3245-v5.13.patch

Attaching a patch that implements JT restart using JobHistory.

_Changes :_
Currently the job history filename is of the following format 
history-timestamp_jt-hostname_jobid_username_jobname. It was introduced in 
HADOOP-239 and the timestamp was added in the beginning since the job names 
were not unique. It makes it difficult to guess the job history filename with 
history-timestamp. So history-timestamp is removed as currently job-id is 
unique across restarts.
So for now we define 
master-file = jt-hostname_jobid_username_jobname.
tmp-file = master-file.tmp

_Working :_
0) Upon restart the JT goes in _safe_ mode. In safe mode all the trackers are 
asked to resend/replay their heartbeat. 

1) For a new job, the history file is the _master-file_. For a restarted job, 
the history is written to the _tmp_ file.

2) Following checks are made for a recovered job
  2.1) If the master file exists then delete the tmp file
  2.2) If the master file is missing then make the tmp file as master

3) Upon restart the master-file is read and default-history-parser is used to 
parse and recover history records. These records are used to create taskStatus 
which is replayed in order. Before replaying the JT waits for the jobs to be 
inited.

4) Once the replay is over, delete the _master-file_ to indicate that the _tmp_ 
file is more recent. Note that on next restart the _tmp_ file will be used for 
recovery.

5) Once all the jobs are recovered, turn off the safe mode. JT will now process 
heartbeats (called as successful re-connect). Also the registration window 
timer starts. JT waits for _tracker-expiry-interval_ time after 
_last-tracker-re-connect_ before closing the window. Once the window closes, JT 
is considered as _recovered_. This plays an important role in detecting the 
trackers that went down while the JT was down. Upon _recovery_, JT re-executes 
all the tasks that were on the lost trackers. 

6) Since the history can have some data missing, there can be a case where the 
_map-completion-event-list_ at the JT is smaller than the one at the tracker. 
Hence there is a rollback required upon restart. Once the JT is out of safe 
mode, it passes this information (_map-events-list-size_) to the tracker on the 
successful reconnect.

7) The tasktracker rollbacks few events and asks the child tasks to reset their 
index to 0. Child tasks fetches  all the events back and filters out necessary 
events for further processing. This is similar to the one discussed in approach 
#1.

8) Errors in history can cause the parser to fail. We have HADOOP-2403 to 
address this. For now this patch encodes errors. This will replaced with the 
fix in HADOOP-2403.

9) Currently counters are stringified and written to history. It is not 
possible to recover the counter back from the string and hence this patch 
encodes the counter-names so that they can be easily recovered. Note that there 
is no encoding in the user space. Only the frameworks history file has codes.

10) Once the job finishes the _tmp_ file is renamed to _master-file_. Similarly 
the history files in the user directory also follow the same renaming cycle.

11) Job priority is logged on every change and hence its recovered.

_Issues :_
1) This approach/patch works fine with history on local fs. With history on 
HDFS, the history file becomes visible but not available (i.e file-size = 0). 
The file becomes available only on close(). Sync() documentation indicates that 
the file-data availability is not guaranteed. 

2) Detecting job runtime is still an issue. 

We are working on it.

_Todo :_
1) Refactor common code.
2) Remove extra logs
3) For ease of testing JT killing facility is added to web-ui. There is some 
extra code to support this. Clear it out.
4) To test the usage of {{sync()}}, there are periodic syncs done to the 
history files. This is just for testing.
5) Optimize encoding/decoding.
6) Group together all the recovery code under something like 
{{JobTrackerRecoveryManager}}.
----
Note that the logs/debugging-code/testing-code is still a part of this patch as 
I am testing it.

> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch, 
> HADOOP-3245-v2.6.9.patch, HADOOP-3245-v4.1.patch, HADOOP-3245-v5.13.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be 
> applied for things like jobs being able to survive jobtracker restarts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to