[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15015038#comment-15015038
]
Bikas Saha commented on TEZ-2581:
---------------------------------
bq. TA_KILLED event sent would only happens in the case of normal dag execution
and task attempt able to be recovered . In the case of task attempt unable to
be recovered, it would go to RUNNING directly.
I agree that the task will move into running state. But the task attempt (that
failed to be commit recovered) will remain in succeeded state. Right? I was
thinking that it should move into killed state or else it will show up
incorrectly as a successful attempt in the UI etc. Is that valid? If yes, then
we can do this as a follow up.
About the multiple arc transition, I was thinking that the default return state
would be the exising kill_in_progress state. Only in the new case we can
override the default return state. In any case, I see what you are saying. Then
the potential alternatives are
1) Send a new Recovery specific event to transition from fail_in_progress to
fail instead of a fake container event.
2) From start_wait can we directly go into killed/failed based on the recovery
data? What advantage are we getting by going into running state and then
transitioning? Would these trigger any container related code which might cause
the AM to crash because containers are not recovered? Also for a killed/failed
task, we dont know if it had gone to running state in the previous AM. There
are valid transitions from start_wait directly to killed/failed. So that
attempt could have gone into failed/killed from start_wait in the previous AM
while in the retry AM we will try to send it through
start_wait->running->kill/fail-in-progress->killed/failed. Thoughts?
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch,
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-12.patch, TEZ-2581-WIP-13.patch,
> TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch,
> TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch,
> TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf,
> TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)