[ 
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600825#action_12600825
 ] 

Hemanth Yamijala commented on HADOOP-3245:
------------------------------------------

To clarify a bit: we considered two different approaches for providing 
persistence in the JobTracker. The first one was to persist completed task 
information to a log file, similar to the method in the NameNode edits log. The 
second one is what Amar has described above, where TaskTrackers send 'Task 
Reports' to the restarted JT on demand. We feel the second approach is 
preferable, as it scales better. 

However, the second approach does have a couple of issues:

- Upon restart of the JT, before a TT has resent its task reports to the JT, 
other TTs could be given the same tasks to execute. This would cause 
re-executions, and the duplicates must be discarded. If this is a serious 
problem, we could make the JT wait for a while (described by Amar above). We 
feel it is OK to re-execute, as it appears that TTs will sync fast enough.
- Upon restart, the order of Task Completion events is lost. This needs to be 
rebuilt from the task reports. The task reports could come in a different order 
from how the task completion events originally came. However, reduce tasks 
which have already fetched some events depend on this order. One way to handle 
this is to make the TT re-fetch all Task Completion events from the beginning 
upon a JT restart. Then it can check which Map outputs it has already shuffled, 
and not get them again. As the time for the shuffle is the more expensive 
operation compared to re-fetching the events, we feel this overhead is 
manageable.

Please do share your thoughts on whether these points make sense.


> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>             Fix For: 0.18.0
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be 
> applied for things like jobs being able to survive jobtracker restarts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to