[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001874#comment-15001874
]
Jeff Zhang edited comment on TEZ-2581 at 11/12/15 9:23 AM:
-----------------------------------------------------------
bq. Please create a jira instead of leaving behind a TODO
Created TEZ-2938
bq. Secondly, for handling commit recovery, would it be possible to do the
check and recovery of commit operation in the recovery flow of the ta_done
inside the attempt itself. That way the full responsibility of recovering the
attempt (including its commit) will stay in the attempt instead of spilling
over into the task? IF you think this would be better, we can do it in a follow
up jira or within this patch, your call.
I think it would be better to leave it in Task. Because I think the commit of
task attempt should be controlled by task rather than itself (just like
TaskAttemptListener also control the commit of task attempt). From the
semantics of the recovery API (OutputComitter#recoverTask), it is to recover
the task rather than task attempt. It would be weird to call
OutputCommitter#recoverTask in TaskAttemptImpl
bq. Why are we sending TaskAttemptEventAttemptFailed and then
TaskAttemptEventContainerTerminated? Shouldn't the first be enough to change
the state to failed?
This is because All the Termination related transition extends
TerminateTransition which is SingleArcTransition. So after
TaskAttemptEventAttemptFailed event, task attempt must go to FAIL_IN_PROGRESS.
That's why I send TaskAttemptEventContainerTerminated after that. Another
solution I can think of is to create another new event e.g. FAIL_IN_RECOVERY,
and then make a new transition for this event.
bq. I skimmed over the test changes. Looks good. I see that we have used the
tez examples for some e2e cases. It may be useful to follow the pattern of
TestFaultTolerance to create some specific controlled cases and use them for a
formal test matrix. E.g. Lets say that we create a 3 level dag. And lets say
vertices can be Initializing (I), HalfRunning (H), Done(D). Then we can have
cases like III, HII, HHH, DDD, HIH etc. This would create a methodical coverage
for various cases. If you agree then this can be a follow up jira.
Created TEZ-2939 for that.
* Currently I use OrderedWordCount & HashJoin as the system test. I think these
are 2 typical dag topological. And for each example, there's one test
matrix,will cover the cases as you mentioned (III, HII, HHH, DDD, HIH)
* Another reason I use examples is that I also need to verify the job result. I
do see before that job recover successfully but get incorrect result. (due to
multiple copies of DM events)
* There's one drawback about the system test. It would take lots of time
because for each case the job would run under minicluster in distributed mode.
And totally there's more than 30 cases.
was (Author: zjffdu):
bq. Please create a jira instead of leaving behind a TODO
Created TEZ-2938
bq. Secondly, for handling commit recovery, would it be possible to do the
check and recovery of commit operation in the recovery flow of the ta_done
inside the attempt itself. That way the full responsibility of recovering the
attempt (including its commit) will stay in the attempt instead of spilling
over into the task? IF you think this would be better, we can do it in a follow
up jira or within this patch, your call.
bq. Why are we sending TaskAttemptEventAttemptFailed and then
TaskAttemptEventContainerTerminated? Shouldn't the first be enough to change
the state to failed?
This is because All the Termination related transition extends
TerminateTransition which is SingleArcTransition. So after
TaskAttemptEventAttemptFailed event, task attempt must go to FAIL_IN_PROGRESS.
That's why I send TaskAttemptEventContainerTerminated after that. Another
solution I can think of is to create another new event e.g. FAIL_IN_RECOVERY,
and then make a new transition for this event.
bq. I skimmed over the test changes. Looks good. I see that we have used the
tez examples for some e2e cases. It may be useful to follow the pattern of
TestFaultTolerance to create some specific controlled cases and use them for a
formal test matrix. E.g. Lets say that we create a 3 level dag. And lets say
vertices can be Initializing (I), HalfRunning (H), Done(D). Then we can have
cases like III, HII, HHH, DDD, HIH etc. This would create a methodical coverage
for various cases. If you agree then this can be a follow up jira.
Created TEZ-2939 for that.
* Currently I use OrderedWordCount & HashJoin as the system test. I think these
are 2 typical dag topological. And for each example, there's one test
matrix,will cover the cases as you mentioned (III, HII, HHH, DDD, HIH)
* Another reason I use examples is that I also need to verify the job result. I
do see before that job recover successfully but get incorrect result. (due to
multiple copies of DM events)
* There's one drawback about the system test. It would take lots of time
because for each case the job would run under minicluster in distributed mode.
And totally there's more than 30 cases.
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch,
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch,
> TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch,
> TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch,
> TezRecoveryRedesignProposal.pdf, TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)