[
https://issues.apache.org/jira/browse/TEZ-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214433#comment-14214433
]
Jeff Zhang edited comment on TEZ-1642 at 11/17/14 9:02 AM:
-----------------------------------------------------------
Attach a new patch, the main change in the patch is add task/taskattempt state
notification in VertexManager
* Currently, only put task/taskattempt state notification in
VertexManagerPluginContextImpl, didn't expose it in user-facing API, this
should has the least impact on the existing code. If in future it is necessary,
we can move it to user-facing API. The notification happens just after the
state change by using the callback of StateMachine transition rather than using
send event asynchronously, this allow us has more accurate control on the
running status.
* Currently, only listen the task/taskattempt succeeded notification, In future
it can extend to other state change notification.
* Changes on the TestAMRecovery
** Remove the verification of TOTAL_LAUNCHED_TASKS, because VM of v1 can only
gurattne it is partially completed or fully completed when killed, but can not
make sure the status of v2 ( using sleep is not perfect solution)
** Only verify one task is completed in the testcase of XXXPartiallyCompeted
rather than assume it is task_0, because it is not guaranteed that task_0 is
scheduled before task_1
was (Author: zjffdu):
Attach a new patch, the main change in the patch is add task/taskattempt state
notification in VertexManager
* Currently, only put task/taskattempt state notification in
VertexManagerPluginContextImpl, didn't expose it in user-facing API, this
should has the least impact on the existing code. If in future it is necessary,
we can move it to user-facing API. The notification happens just after the
state change by using the callback of StateMachine transition rather than using
send event asynchronously, this allow us has more accurate control on the
running status.
* Currently, only listen the task/taskattempt succeeded notification, In future
it can extend to other state change notification.
* Changes on the TestAMRecovery
** Remove the verification of TOTAL_LAUNCHED_TASKS, because VM of v1 can only
gurattne it is partially completed or fully completed when killed, but can not
make sure the status of v2 ( using sleep is not perfect solution)
** Only verify one task is completed in the testcase of XXXPartiallyCompeted
rather than assure it is task_0, because it is not guaranteed that task_0 is
scheduled before task_1
> TestAMRecovery sometimes fail
> -----------------------------
>
> Key: TEZ-1642
> URL: https://issues.apache.org/jira/browse/TEZ-1642
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-1642-2.patch, TEZ-1642-3.patch, TEZ-1642.patch
>
>
> TestAMRecovery fails sometimes on testVertexPartiallyFinished_XXX.
> The scenario is that we'd like kill AM when vertex is partially finished (
> with 2 tasks, task_0 is finished and task_1 is running). When in recovery,
> task_0 should not rerun and task_1 should rerun. ( We use the recovery
> log(TaskAttemptFinishedEvent) to judge whether task is rerun)
> Currently, using VertexManager.onSourceTaskCompleted to control when to kill
> AM, but it is not perfect. VertexManager.onSourceTaskCompleted is not
> invoked at the moment task attempt is finished ( TaskAttempt send event to
> Task to tell TaskAttempt is finsihed, and then Task send event to Vertex to
> trigger VM.onSourceTaskCompleted)
> The following case is possible: task_0 finished -> task_1 finished ->
> VM.onSourceTaskCompleted -> VM.onSourceTaskCompleted
> In this case, we will take it as partially completed in the first
> VM.onSourceTaskCompleted, but actually the vertex is fully completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)