[ 
https://issues.apache.org/jira/browse/HADOOP-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676294#action_12676294
 ] 

Amar Kamat commented on HADOOP-5319:
------------------------------------

As of now there are 2 ways to achieve this
# Create a file for every tracker in _system-dir_. Upon a (re)join, simply 
create the file and upon a lost tracker, simply delete the file (as the first 
step before changing jobtracker's data structures). So, upon restart, there 
will be 2 lists, one obtained from the history and one from the _system-dir_. 
Lose all the tracker that have file missing in the _system-dir_. The trackers 
should be lost before opening up the jobtracker for the trackers. Note that the 
file will be created in memory (of namenode) and will be part of the heartbeat 
(once every tracker [re] join) while the deletes will be happening in a thread.
Benefits :  No sync required i.e faster per tracker update
Drawbacks : Too many files 
 
# Maintain in a file, a list currently available tracker  with the jobtracker. 
Assume that the jobtracker waits for X (say mapred.tasktracker.expiry.interval) 
units of time before creating the first file. After which on every tracker 
update (join/delete) the tracker-file is updated ( a new file will be written 
and then renamed to the old filename). Note that the file operation is not just 
in memory and requires writing the file. Hence this solution requires the 
access has to be batched up and not inline with heartbeat (i.e a thread which 
does this writing). 
Benefits : Less files
Drawbacks : Every update requires writing the current list of tracker to the 
file and replacing the old with new.

Both these solution (should) guarantee that the tracker that were lost in the 
old jobtracker should be lost in the new tracker. Thoughts?

> Inconsistency in handling lost trackers upon jobtracker restart
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5319
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5319
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> If a tasktracker is lost, the jobtracker kills all the tasks that were 
> successful on that tracker and re-executes it somewhere else. In-memory 
> datastructures are all cleared up for the lost tracker. Now if the jobtracker 
> restarts, the new jobtracker has no clue about the trackers that were lost 
> and hence if the lost tracker join back, they will be accepted and all the 
> tasks on those tracker will join back. Following are the issues
> - If the running task on the lost tracker is killed, its cleanup attempt will 
> be launched. Now the new jobtracker has no idea about this attempt. Also the 
> lost tracker can join back and hence there are 2 attempts that are running 
> with the same id, one which can move the tip to success and other which moves 
> the tip to killed state.
> - Ideally, the lost tracker should be asked to re-init which wont happen now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to