[
https://issues.apache.org/jira/browse/HADOOP-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676294#action_12676294
]
Amar Kamat commented on HADOOP-5319:
------------------------------------
As of now there are 2 ways to achieve this
# Create a file for every tracker in _system-dir_. Upon a (re)join, simply
create the file and upon a lost tracker, simply delete the file (as the first
step before changing jobtracker's data structures). So, upon restart, there
will be 2 lists, one obtained from the history and one from the _system-dir_.
Lose all the tracker that have file missing in the _system-dir_. The trackers
should be lost before opening up the jobtracker for the trackers. Note that the
file will be created in memory (of namenode) and will be part of the heartbeat
(once every tracker [re] join) while the deletes will be happening in a thread.
Benefits : No sync required i.e faster per tracker update
Drawbacks : Too many files
# Maintain in a file, a list currently available tracker with the jobtracker.
Assume that the jobtracker waits for X (say mapred.tasktracker.expiry.interval)
units of time before creating the first file. After which on every tracker
update (join/delete) the tracker-file is updated ( a new file will be written
and then renamed to the old filename). Note that the file operation is not just
in memory and requires writing the file. Hence this solution requires the
access has to be batched up and not inline with heartbeat (i.e a thread which
does this writing).
Benefits : Less files
Drawbacks : Every update requires writing the current list of tracker to the
file and replacing the old with new.
Both these solution (should) guarantee that the tracker that were lost in the
old jobtracker should be lost in the new tracker. Thoughts?
> Inconsistency in handling lost trackers upon jobtracker restart
> ---------------------------------------------------------------
>
> Key: HADOOP-5319
> URL: https://issues.apache.org/jira/browse/HADOOP-5319
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Amar Kamat
>
> If a tasktracker is lost, the jobtracker kills all the tasks that were
> successful on that tracker and re-executes it somewhere else. In-memory
> datastructures are all cleared up for the lost tracker. Now if the jobtracker
> restarts, the new jobtracker has no clue about the trackers that were lost
> and hence if the lost tracker join back, they will be accepted and all the
> tasks on those tracker will join back. Following are the issues
> - If the running task on the lost tracker is killed, its cleanup attempt will
> be launched. Now the new jobtracker has no idea about this attempt. Also the
> lost tracker can join back and hence there are 2 attempts that are running
> with the same id, one which can move the tip to success and other which moves
> the tip to killed state.
> - Ideally, the lost tracker should be asked to re-init which wont happen now.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.