[jira] [Commented] (TEZ-2581) Umbrella for Tez Recovery Redesign

Bikas Saha (JIRA) Tue, 27 Oct 2015 06:29:57 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976365#comment-14976365
 ]


Bikas Saha commented on TEZ-2581:
---------------------------------

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
Comments embedded in the code snippet itself.
{code}
  protected static class ScheduleTaskattemptTransition implements ....
      if (ta.recoveryData != null) {
        TaskAttemptStartedEvent taStartedEvent =
            ta.recoveryData.getTaskAttemptStartedEvent(ta.getID());
        // If TaskAttemptStartedEvent is seen, we don't need to ask for new 
container to launch it.
        // Just send TaskAttemptEventStartedRemotely to move it to RUNNING
        if (taStartedEvent != null) { >>>> debug logging
          LOG.info("TaskAttemptStartedEvent is seen, send 
TaskAttemptEventStartedRemotely to itself, attemptId=" + ta.attemptId);
          ta.sendEvent(new TaskAttemptEventStartedRemotely(ta.attemptId, 
taStartedEvent.getContainerId(), null));
          return TaskAttemptStateInternal.START_WAIT;
        } ......
        if (taFinishedEvent != null) {
          if (taFinishedEvent.getState() == TaskAttemptState.FAILED) { >>>> 
debug logging
            LOG.info("TaskAttemptFinishedEvent is seen with state of FAILED, 
send TaskAttemptEventAttemptFailed to itself"
                + ", attemptId=" + ta.attemptId);
            ta.sendEvent(new TaskAttemptEventAttemptFailed(ta.getID(), 
TaskAttemptEventType.TA_FAILED,
                taFinishedEvent.getDiagnostics(), 
taFinishedEvent.getTaskAttemptError()));
          } else { >>>>>> Precondition that state should be KILLED?
            LOG.info("TaskAttemptFinishedEvent is seen with state of KILLED, 
send TaskAttemptEventKillRequest to itself"
                + ", attemptId=" + ta.attemptId);
            ta.sendEvent(new TaskAttemptEventKillRequest(ta.getID(),
                taFinishedEvent.getDiagnostics(), 
taFinishedEvent.getTaskAttemptError()));
          }
          return TaskAttemptStateInternal.NEW;
        }
>>>>> can the control come if recovery data is non null. If yes or no, are 
>>>>> there some preconditions for that?
      }
      
      TaskAttemptEventSchedule scheduleEvent = (TaskAttemptEventSchedule) event;
      ta.scheduledTime = ta.clock.getTime();
{code}

The if condition for container being null is not semantically accurate since 
container is expected to be non-null. IMO, the code flow (everywhere) should be 
like it is in ScheduleTaskattemptTransition, where first recovery data is 
checked for alternatives. If there is no alternative then regular code flow 
happens without any new checks.
{code}
      // Container may be null in recovery since we do not recover container 
for now.
      if (ta.appContext.getAllContainers().get(event.getContainerId()) != null) 
{
        AMContainer amContainer = 
ta.appContext.getAllContainers().get(event.getContainerId()); 
{code}
The pattern below, where possible, may be more maintainable
{code}// do common work
if (recover from recoveryData) {
  // do recovery exclusive work e.g. send events to itself
} else {
  // do normal exclusive work. e.g. logHistoryStarted() here and not have 
recovery logic inside logHistoryStarted() itself
}
// do common work{code} e.g. send events to Task

Comments in the code block directly
{code}
        if (taFinishedEvent != null) {
.....
        } <<<<< should the next line be in the else block of the above if? 
Seems to be so from the comments
        if (ta.recoveryData.getTaskAttemptStartedEvent(ta.getID()) != null) {
          // If TaskAttemptStartedEvent is seen but no TaskAttemptFinishedEvent 
is seen, <<<< should be in else stmt?
          // that means it is still in RUNNING in the last AM attempt,
          // Send TaskAttemptEventContainerTerminated to move it to KILLED.
          // In the future when we can support recover container, it is not 
necessary to send this event.
          LOG.info("TaskAttempt is still running in the last AM attempt, 
attemptId=" + ta.attemptId); <<< debug?
<<<<<< send a kill event instead of faking container terminated else we may see 
future bugs in the code for handling container termination
          ta.sendEvent(new TaskAttemptEventContainerTerminated(ta.attemptId, 
                  "Task Attempt killed in recovery due to can't recover the 
running task attempt",
                  TaskAttemptTerminationCause.TERMINATED_AT_SHUTDOWN)); <<<<< 
we should create a new termination cause
        }
{code}

TezEventUpdaterTransition - we could accumulate events and then write them upon 
finish once - any downsides to that?

For debugging/tracing it may be useful to add a isRecovered flag in each of the 
events we send out because of alternative recovery code path. This was in debug 
logging etc. we could see e.g. if the ta_done was sent by the listener or by 
the TA itself.

Can we check that the post recovery value of various times like 
startTime/launchTime/finishTime etc. remain at their pre recovery values?

Can we simply make the recoveryData availabe from the appContext instead of 
changing the constructor of TA/Task/Vertex/DAG etc. ? Might remove a lot of 
diffs.


> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2581) Umbrella for Tez Recovery Redesign

Reply via email to