[
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600825#action_12600825
]
Hemanth Yamijala commented on HADOOP-3245:
------------------------------------------
To clarify a bit: we considered two different approaches for providing
persistence in the JobTracker. The first one was to persist completed task
information to a log file, similar to the method in the NameNode edits log. The
second one is what Amar has described above, where TaskTrackers send 'Task
Reports' to the restarted JT on demand. We feel the second approach is
preferable, as it scales better.
However, the second approach does have a couple of issues:
- Upon restart of the JT, before a TT has resent its task reports to the JT,
other TTs could be given the same tasks to execute. This would cause
re-executions, and the duplicates must be discarded. If this is a serious
problem, we could make the JT wait for a while (described by Amar above). We
feel it is OK to re-execute, as it appears that TTs will sync fast enough.
- Upon restart, the order of Task Completion events is lost. This needs to be
rebuilt from the task reports. The task reports could come in a different order
from how the task completion events originally came. However, reduce tasks
which have already fetched some events depend on this order. One way to handle
this is to make the TT re-fetch all Task Completion events from the beginning
upon a JT restart. Then it can check which Map outputs it has already shuffled,
and not get them again. As the time for the shuffle is the more expensive
operation compared to re-fetching the events, we feel this overhead is
manageable.
Please do share your thoughts on whether these points make sense.
> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
> Key: HADOOP-3245
> URL: https://issues.apache.org/jira/browse/HADOOP-3245
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Amar Kamat
> Fix For: 0.18.0
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be
> applied for things like jobs being able to survive jobtracker restarts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.