[jira] [Comment Edited] (TEZ-2581) Umbrella for Tez Recovery Redesign

Bikas Saha (JIRA) Sun, 22 Nov 2015 15:19:59 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15021234#comment-15021234
 ]


Bikas Saha edited comment on TEZ-2581 at 11/22/15 11:18 PM:
------------------------------------------------------------

Could you please a //for recovery comment in the state machine transitions for 
the ones that were changed for recovery? E.g. NEW->FAILED in TAImpl

TaskAttemptImpl.java
In New->Killed should we use TerminatedBeforeRunningTransition like we do for 
New->Failed. Or vice versa in New->Failed should we use TerminateTransition? 
TerminateTransition seems better since it does not send unnecessary events to 
the scheduler.

Incorrect log
{code}
+          case SUCCEEDED:
+            LOG.debug("TaskAttemptFinishedEvent is seen with state of 
SUCCEEDED"
+                + ", send TaskAttemptEventKillRequest to itself"
+                + ", attemptId=" + ta.attemptId);
+            ta.sendEvent(new TaskAttemptEvent(ta.getID(), 
TaskAttemptEventType.TA_DONE));{code}

TaskImpl.java
Is this still needed now that we move handle TA recovery in the scheduled 
transition itself?
{code}-    .addTransition(TaskStateInternal.RUNNING, 
TaskStateInternal.SUCCEEDED,
+    .addTransition(TaskStateInternal.RUNNING,
+        EnumSet.of(TaskStateInternal.SUCCEEDED, TaskStateInternal.FAILED, 
TaskStateInternal.RUNNING),
         TaskEventType.T_ATTEMPT_SUCCEEDED,
{code}

Please put a comment here with TEZ-2958 for moving this TA to killed state.
{code}+          task.successfulAttempt = null;
+          task.addAndScheduleAttempt(successTaId);
+          return TaskStateInternal.RUNNING{code}

Can the TaskFinishedEvent have a failed state also? Previous AM had a task 
failure and then crashed before dag failed finish was written to recovery?
{code}+        // If TaskStartedEvent is not seen but TaskFinishedEvent is 
seen, that means 
+        // Task is killed before it is started. Just send T_TERMINATE to 
itself to move to KILLED
+        if (tStartedEvent == null
+            && tFinishedEvent != null) {
+          Preconditions.checkArgument(tFinishedEvent.getState() == 
TaskState.KILLED,
+              "TaskStartedEvent is not seen, but TaskFinishedEvent is seen and 
with invalid state="
+                  + tFinishedEvent.getState() + ", taskId=" + 
task.getTaskId());
+          // TODO (TEZ-2938)
+          // use tFinishedEvent.getTerminationCause after adding 
TaskTerminationCause to TaskFinishedEvent
+          task.eventHandler.handle(new TaskEventTermination(task.taskId,
+              TaskAttemptTerminationCause.UNKNOWN_ERROR, 
tFinishedEvent.getDiagnostics(), true));
+          return TaskStateInternal.NEW;
+        }{code}

Please put if (log.isdebugenabled) in the NoopVertexManager code.

bq. Yes, this is an issue which may confuse users but would not affect the job. 
Will create a follow up jira, and I think it will exist not only in the 
recovery case, but also in speculation.
For speculation, the winning TA goes to succeeded. The other TA is killed.



was (Author: bikassaha):
Could you please a //for recovery comment in the state machine transitions for 
the ones that were changed for recovery? E.g. NEW->FAILED in TAImpl

TaskAttemptImpl.java
In New->Killed should we use TerminatedBeforeRunningTransition like we do for 
New->Failed. Or vice versa in New->Failed should we use TerminateTransition?

Incorrect log
{code}
+          case SUCCEEDED:
+            LOG.debug("TaskAttemptFinishedEvent is seen with state of 
SUCCEEDED"
+                + ", send TaskAttemptEventKillRequest to itself"
+                + ", attemptId=" + ta.attemptId);
+            ta.sendEvent(new TaskAttemptEvent(ta.getID(), 
TaskAttemptEventType.TA_DONE));{code}

TaskImpl.java
Is this still needed now that we move handle TA recovery in the scheduled 
transition itself?
{code}-    .addTransition(TaskStateInternal.RUNNING, 
TaskStateInternal.SUCCEEDED,
+    .addTransition(TaskStateInternal.RUNNING,
+        EnumSet.of(TaskStateInternal.SUCCEEDED, TaskStateInternal.FAILED, 
TaskStateInternal.RUNNING),
         TaskEventType.T_ATTEMPT_SUCCEEDED,
{code}

Please put a comment here with TEZ-2958 for moving this TA to killed state.
{code}+          task.successfulAttempt = null;
+          task.addAndScheduleAttempt(successTaId);
+          return TaskStateInternal.RUNNING{code}

Can the TaskFinishedEvent have a failed state also? Previous AM had a task 
failure and then crashed before dag failed finish was written to recovery?
{code}+        // If TaskStartedEvent is not seen but TaskFinishedEvent is 
seen, that means 
+        // Task is killed before it is started. Just send T_TERMINATE to 
itself to move to KILLED
+        if (tStartedEvent == null
+            && tFinishedEvent != null) {
+          Preconditions.checkArgument(tFinishedEvent.getState() == 
TaskState.KILLED,
+              "TaskStartedEvent is not seen, but TaskFinishedEvent is seen and 
with invalid state="
+                  + tFinishedEvent.getState() + ", taskId=" + 
task.getTaskId());
+          // TODO (TEZ-2938)
+          // use tFinishedEvent.getTerminationCause after adding 
TaskTerminationCause to TaskFinishedEvent
+          task.eventHandler.handle(new TaskEventTermination(task.taskId,
+              TaskAttemptTerminationCause.UNKNOWN_ERROR, 
tFinishedEvent.getDiagnostics(), true));
+          return TaskStateInternal.NEW;
+        }{code}

Please put if (log.isdebugenabled) in the NoopVertexManager code.

bq. Yes, this is an issue which may confuse users but would not affect the job. 
Will create a follow up jira, and I think it will exist not only in the 
recovery case, but also in speculation.
For speculation, the winning TA goes to succeeded. The other TA is killed.


> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch, 
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-12.patch, TEZ-2581-WIP-13.patch, 
> TEZ-2581-WIP-14.patch, TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch, 
> TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch, 
> TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch, 
> TezRecoveryRedesignProposal.pdf, TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2581) Umbrella for Tez Recovery Redesign

Reply via email to