[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15021234#comment-15021234
]
Bikas Saha edited comment on TEZ-2581 at 11/22/15 11:18 PM:
------------------------------------------------------------
Could you please a //for recovery comment in the state machine transitions for
the ones that were changed for recovery? E.g. NEW->FAILED in TAImpl
TaskAttemptImpl.java
In New->Killed should we use TerminatedBeforeRunningTransition like we do for
New->Failed. Or vice versa in New->Failed should we use TerminateTransition?
TerminateTransition seems better since it does not send unnecessary events to
the scheduler.
Incorrect log
{code}
+ case SUCCEEDED:
+ LOG.debug("TaskAttemptFinishedEvent is seen with state of
SUCCEEDED"
+ + ", send TaskAttemptEventKillRequest to itself"
+ + ", attemptId=" + ta.attemptId);
+ ta.sendEvent(new TaskAttemptEvent(ta.getID(),
TaskAttemptEventType.TA_DONE));{code}
TaskImpl.java
Is this still needed now that we move handle TA recovery in the scheduled
transition itself?
{code}- .addTransition(TaskStateInternal.RUNNING,
TaskStateInternal.SUCCEEDED,
+ .addTransition(TaskStateInternal.RUNNING,
+ EnumSet.of(TaskStateInternal.SUCCEEDED, TaskStateInternal.FAILED,
TaskStateInternal.RUNNING),
TaskEventType.T_ATTEMPT_SUCCEEDED,
{code}
Please put a comment here with TEZ-2958 for moving this TA to killed state.
{code}+ task.successfulAttempt = null;
+ task.addAndScheduleAttempt(successTaId);
+ return TaskStateInternal.RUNNING{code}
Can the TaskFinishedEvent have a failed state also? Previous AM had a task
failure and then crashed before dag failed finish was written to recovery?
{code}+ // If TaskStartedEvent is not seen but TaskFinishedEvent is
seen, that means
+ // Task is killed before it is started. Just send T_TERMINATE to
itself to move to KILLED
+ if (tStartedEvent == null
+ && tFinishedEvent != null) {
+ Preconditions.checkArgument(tFinishedEvent.getState() ==
TaskState.KILLED,
+ "TaskStartedEvent is not seen, but TaskFinishedEvent is seen and
with invalid state="
+ + tFinishedEvent.getState() + ", taskId=" +
task.getTaskId());
+ // TODO (TEZ-2938)
+ // use tFinishedEvent.getTerminationCause after adding
TaskTerminationCause to TaskFinishedEvent
+ task.eventHandler.handle(new TaskEventTermination(task.taskId,
+ TaskAttemptTerminationCause.UNKNOWN_ERROR,
tFinishedEvent.getDiagnostics(), true));
+ return TaskStateInternal.NEW;
+ }{code}
Please put if (log.isdebugenabled) in the NoopVertexManager code.
bq. Yes, this is an issue which may confuse users but would not affect the job.
Will create a follow up jira, and I think it will exist not only in the
recovery case, but also in speculation.
For speculation, the winning TA goes to succeeded. The other TA is killed.
was (Author: bikassaha):
Could you please a //for recovery comment in the state machine transitions for
the ones that were changed for recovery? E.g. NEW->FAILED in TAImpl
TaskAttemptImpl.java
In New->Killed should we use TerminatedBeforeRunningTransition like we do for
New->Failed. Or vice versa in New->Failed should we use TerminateTransition?
Incorrect log
{code}
+ case SUCCEEDED:
+ LOG.debug("TaskAttemptFinishedEvent is seen with state of
SUCCEEDED"
+ + ", send TaskAttemptEventKillRequest to itself"
+ + ", attemptId=" + ta.attemptId);
+ ta.sendEvent(new TaskAttemptEvent(ta.getID(),
TaskAttemptEventType.TA_DONE));{code}
TaskImpl.java
Is this still needed now that we move handle TA recovery in the scheduled
transition itself?
{code}- .addTransition(TaskStateInternal.RUNNING,
TaskStateInternal.SUCCEEDED,
+ .addTransition(TaskStateInternal.RUNNING,
+ EnumSet.of(TaskStateInternal.SUCCEEDED, TaskStateInternal.FAILED,
TaskStateInternal.RUNNING),
TaskEventType.T_ATTEMPT_SUCCEEDED,
{code}
Please put a comment here with TEZ-2958 for moving this TA to killed state.
{code}+ task.successfulAttempt = null;
+ task.addAndScheduleAttempt(successTaId);
+ return TaskStateInternal.RUNNING{code}
Can the TaskFinishedEvent have a failed state also? Previous AM had a task
failure and then crashed before dag failed finish was written to recovery?
{code}+ // If TaskStartedEvent is not seen but TaskFinishedEvent is
seen, that means
+ // Task is killed before it is started. Just send T_TERMINATE to
itself to move to KILLED
+ if (tStartedEvent == null
+ && tFinishedEvent != null) {
+ Preconditions.checkArgument(tFinishedEvent.getState() ==
TaskState.KILLED,
+ "TaskStartedEvent is not seen, but TaskFinishedEvent is seen and
with invalid state="
+ + tFinishedEvent.getState() + ", taskId=" +
task.getTaskId());
+ // TODO (TEZ-2938)
+ // use tFinishedEvent.getTerminationCause after adding
TaskTerminationCause to TaskFinishedEvent
+ task.eventHandler.handle(new TaskEventTermination(task.taskId,
+ TaskAttemptTerminationCause.UNKNOWN_ERROR,
tFinishedEvent.getDiagnostics(), true));
+ return TaskStateInternal.NEW;
+ }{code}
Please put if (log.isdebugenabled) in the NoopVertexManager code.
bq. Yes, this is an issue which may confuse users but would not affect the job.
Will create a follow up jira, and I think it will exist not only in the
recovery case, but also in speculation.
For speculation, the winning TA goes to succeeded. The other TA is killed.
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch,
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-12.patch, TEZ-2581-WIP-13.patch,
> TEZ-2581-WIP-14.patch, TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch,
> TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch,
> TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch,
> TezRecoveryRedesignProposal.pdf, TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)