[ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000720#comment-15000720
 ] 

Bikas Saha commented on TEZ-2581:
---------------------------------


I think we are close to getting this committed :)

Thanks for choosing Option 4. The main code in VertexImpl now only has a small 
work around for NoOpManager. That may be easier to change if we later want to 
support running the original manager with some user state.
NoOpManager - we can remove code duplication by having a common method to 
reconfigureVertex. Also lets use the reconfigureVertex method that sets 
everything to be safe. E.g. even in initializing state original manager could 
have changed the edges (and not just the input specs).

TaskImpl.java
Minor: IMO the if condition will be more clear if we use task.recoveryData != 
null. This implies the semantics are the same as the code just before it which 
is guarded by the same condition.
{code}+      if (!scheduleEvent.isFromRecovery()) {
+        task.scheduledTime = task.clock.getTime();
+        task.logJobHistoryTaskStartedEvent();
+      }
{code}

Please create a jira instead of leaving behind a TODO
{code}+          // TODO use tFinishedEvent.getTerminationCause after adding 
TaskTerminationCause to TaskFinishedEvent
+          task.eventHandler.handle(new TaskEventTermination(task.taskId,{code}

Secondly, for handling commit recovery, would it be possible to do the check 
and recovery of commit operation in the recovery flow of the ta_done inside the 
attempt itself. That way the full responsibility of recovering the attempt 
(including its commit) will stay in the attempt instead of spilling over into 
the task? IF you think this would be better, we can do it in a follow up jira 
or within this patch, your call.

TaskAttemptImpl.java
Why are we sending TaskAttemptEventAttemptFailed and then 
TaskAttemptEventContainerTerminated? Shouldn't the first be enough to change 
the state to failed?
IMO, sending containerterminated event is risky since in the recovery flow we 
dont have containers populated and the transition for handling container 
terminated event may try to use container related info (if not now then in the 
future)
{code}+            case FAILED:
+              LOG.debug("TaskAttempt is failed in the last AM attempt, 
attemptId=" + ta.attemptId);
+              ta.sendEvent(new TaskAttemptEventAttemptFailed(ta.attemptId,
+                  TaskAttemptEventType.TA_FAILED, 
taFinishedEvent.getDiagnostics(), taFinishedEvent.getTaskAttemptError()));
+              // send a faked ContainerTerminatedEvent to move the TaskAttempt 
to FAILED state.
+              ta.sendEvent(new 
TaskAttemptEventContainerTerminated(ta.attemptId,
+                  taFinishedEvent.getDiagnostics(), 
taFinishedEvent.getTaskAttemptError()));
+              break;
+{code}

I skimmed over the test changes. Looks good. I see that we have used the tez 
examples for some e2e cases. It may be useful to follow the pattern of 
TestFaultTolerance to create some specific controlled cases and use them for a 
formal test matrix. E.g. Lets say that we create a 3 level dag. And lets say 
vertices can be Initializing (I), HalfRunning (H), Done(D). Then we can have 
cases like III, HII, HHH, DDD, HIH etc. This would create a methodical coverage 
for various cases. If you agree then this can be a follow up jira.


> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch, 
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch, 
> TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch, 
> TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch, 
> TezRecoveryRedesignProposal.pdf, TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to