[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000720#comment-15000720
]
Bikas Saha commented on TEZ-2581:
---------------------------------
I think we are close to getting this committed :)
Thanks for choosing Option 4. The main code in VertexImpl now only has a small
work around for NoOpManager. That may be easier to change if we later want to
support running the original manager with some user state.
NoOpManager - we can remove code duplication by having a common method to
reconfigureVertex. Also lets use the reconfigureVertex method that sets
everything to be safe. E.g. even in initializing state original manager could
have changed the edges (and not just the input specs).
TaskImpl.java
Minor: IMO the if condition will be more clear if we use task.recoveryData !=
null. This implies the semantics are the same as the code just before it which
is guarded by the same condition.
{code}+ if (!scheduleEvent.isFromRecovery()) {
+ task.scheduledTime = task.clock.getTime();
+ task.logJobHistoryTaskStartedEvent();
+ }
{code}
Please create a jira instead of leaving behind a TODO
{code}+ // TODO use tFinishedEvent.getTerminationCause after adding
TaskTerminationCause to TaskFinishedEvent
+ task.eventHandler.handle(new TaskEventTermination(task.taskId,{code}
Secondly, for handling commit recovery, would it be possible to do the check
and recovery of commit operation in the recovery flow of the ta_done inside the
attempt itself. That way the full responsibility of recovering the attempt
(including its commit) will stay in the attempt instead of spilling over into
the task? IF you think this would be better, we can do it in a follow up jira
or within this patch, your call.
TaskAttemptImpl.java
Why are we sending TaskAttemptEventAttemptFailed and then
TaskAttemptEventContainerTerminated? Shouldn't the first be enough to change
the state to failed?
IMO, sending containerterminated event is risky since in the recovery flow we
dont have containers populated and the transition for handling container
terminated event may try to use container related info (if not now then in the
future)
{code}+ case FAILED:
+ LOG.debug("TaskAttempt is failed in the last AM attempt,
attemptId=" + ta.attemptId);
+ ta.sendEvent(new TaskAttemptEventAttemptFailed(ta.attemptId,
+ TaskAttemptEventType.TA_FAILED,
taFinishedEvent.getDiagnostics(), taFinishedEvent.getTaskAttemptError()));
+ // send a faked ContainerTerminatedEvent to move the TaskAttempt
to FAILED state.
+ ta.sendEvent(new
TaskAttemptEventContainerTerminated(ta.attemptId,
+ taFinishedEvent.getDiagnostics(),
taFinishedEvent.getTaskAttemptError()));
+ break;
+{code}
I skimmed over the test changes. Looks good. I see that we have used the tez
examples for some e2e cases. It may be useful to follow the pattern of
TestFaultTolerance to create some specific controlled cases and use them for a
formal test matrix. E.g. Lets say that we create a 3 level dag. And lets say
vertices can be Initializing (I), HalfRunning (H), Done(D). Then we can have
cases like III, HII, HHH, DDD, HIH etc. This would create a methodical coverage
for various cases. If you agree then this can be a follow up jira.
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch,
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch,
> TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch,
> TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch,
> TezRecoveryRedesignProposal.pdf, TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)