[ 
https://issues.apache.org/jira/browse/TEZ-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214433#comment-14214433
 ] 

Jeff Zhang edited comment on TEZ-1642 at 11/25/14 12:32 AM:
------------------------------------------------------------

Attach a new patch, the main change in the patch is add task/taskattempt state 
notification in VertexManager [~hitesh] [~sseth] [~bikassaha] Please help 
review the new patch.

* Currently, only put task/taskattempt state notification in 
VertexManagerPluginContextImpl, didn't expose it in user-facing API, this 
should has the least impact on the existing code. If in future it is necessary, 
we can move it to user-facing API. The notification happens just after the 
state change by using the callback of StateMachine transition rather than using 
send event asynchronously, this allow us has more accurate control on the 
running status.
* Currently, only listen the task/taskattempt succeeded notification, In future 
it can extend to other state change notification.
* Changes on the TestAMRecovery
** Remove the verification of TOTAL_LAUNCHED_TASKS, because VM of v1 can only 
gurattne it is partially completed or fully completed when killed, but can not 
make sure the status of v2 ( using sleep is not perfect solution)
** Only verify one task is completed in the testcase of XXXPartiallyCompeted 
rather than assume it is task_0, because it is not guaranteed that task_0 is 
scheduled before task_1



was (Author: zjffdu):
Attach a new patch, the main change in the patch is add task/taskattempt state 
notification in VertexManager

* Currently, only put task/taskattempt state notification in 
VertexManagerPluginContextImpl, didn't expose it in user-facing API, this 
should has the least impact on the existing code. If in future it is necessary, 
we can move it to user-facing API. The notification happens just after the 
state change by using the callback of StateMachine transition rather than using 
send event asynchronously, this allow us has more accurate control on the 
running status.
* Currently, only listen the task/taskattempt succeeded notification, In future 
it can extend to other state change notification.
* Changes on the TestAMRecovery
** Remove the verification of TOTAL_LAUNCHED_TASKS, because VM of v1 can only 
gurattne it is partially completed or fully completed when killed, but can not 
make sure the status of v2 ( using sleep is not perfect solution)
** Only verify one task is completed in the testcase of XXXPartiallyCompeted 
rather than assume it is task_0, because it is not guaranteed that task_0 is 
scheduled before task_1


> TestAMRecovery sometimes fail
> -----------------------------
>
>                 Key: TEZ-1642
>                 URL: https://issues.apache.org/jira/browse/TEZ-1642
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1642-2.patch, TEZ-1642-3.patch, TEZ-1642-4.patch, 
> TEZ-1642.patch
>
>
> TestAMRecovery fails sometimes on testVertexPartiallyFinished_XXX.  
> The scenario is that we'd like kill AM when vertex is partially finished ( 
> with 2 tasks, task_0 is finished and task_1 is running). When in recovery, 
> task_0 should not rerun and task_1 should rerun. ( We use the recovery 
> log(TaskAttemptFinishedEvent) to judge whether task is rerun)
> Currently, using VertexManager.onSourceTaskCompleted to control when to kill 
> AM, but it is not perfect.  VertexManager.onSourceTaskCompleted is not 
> invoked at the moment task attempt is finished ( TaskAttempt send event to 
> Task to tell TaskAttempt is finsihed, and then Task send event to Vertex to 
> trigger VM.onSourceTaskCompleted) 
> The following case is possible: task_0 finished -> task_1 finished -> 
> VM.onSourceTaskCompleted -> VM.onSourceTaskCompleted
> In this case, we will take it as partially completed in the first 
> VM.onSourceTaskCompleted, but actually the vertex is fully completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to