[jira] [Comment Edited] (TEZ-992) Recovery data should not be written on AsyncDispatcher thread

Jeff Zhang (JIRA) Thu, 20 Nov 2014 02:57:08 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219187#comment-14219187
 ]


Jeff Zhang edited comment on TEZ-992 at 11/20/14 10:56 AM:
-----------------------------------------------------------

Seems this jira related to TEZ-714 closely, I try to resolve them together.  I 
create a FinishSavingService for logging critical recovery event and 
commit/abort.

Attach the 2 state machines diagrams (DAG/Vertex)  [~bikassaha], [~hitesh], 
please help review the state machine and answer my following 2 questions
* The main change is that I add one additional state: FINISH_SAVING.  ( before 
going to SUCCEEDED/FAILED/KILLED, it should go to FINISH_SAVING first )
** In Vertex's FINISH_SAVING, it will log recovery data (VertexFinisheEvent, 
VertexCommitStartedEvent) and commit/abort the data if necessary. 
** in DAG's FINISH_SAVING, it will log recovery data (DAGCommitStartedEvent, 
DAGFinishedEvent) and commit the/abort data if necessary
** For VertexGroupCommitStartedEvent / VertexGroupCommitFinishedEvent, I will 
run it in FinishSavingService and keep DAG in the state of RUNNING.
** DAGSubmittedEvent is an special event, it is been logged in the RPC call 
submitDAG, so it is not handled in the FinishSavingService.
** In recovery, still keep logging recovery event and commit in main 
AsycDispatcher.  Not so confident on moving it out of main AsyncDispatcher now, 
may leave it in another jira. 
 
I have implemented a prototype of this feature ( run tez examples successfully 
and TestAMRecovery successfully ) but still need some code refinement, will 
attach the patch soon. 
When I implement it, I still have the following questions, hope to get some 
feedback about them.

* initializing of committer is still in the main AsycDispather thread, is it 
acceptable ? 
* Can TerminateEvent been ignored when DAG/Vertex is in FINISH_SAVING ? IMO, I 
think it can been ignored. Because if we don't ignore it, we still need to 
abort the committer. Since for both ignore and not-ingore we both have to call 
commit or abort, I think ignore it is acceptable.






was (Author: zjffdu):
Seems this jira related to TEZ-714 closely, I try to resolve them together.  I 
create a FinishSavingService for logging critical recovery event and 
commit/abort.

Attach the 2 state machines diagrams (DAG/Vertex)  [~bikassaha], [~hitesh], 
please help review the state machine and answer my following 2 questions
* The main change is that I add one additional state: FINISH_SAVING.  ( 
RUNNING/TERMINATING will transite to FINISH_SAVING first and then go to 
SUCCEEDED/FAILED/KILLED )
** In Vertex's FINISH_SAVING, it will log recovery data (VertexFinisheEvent, 
VertexCommitStartedEvent) and commit/abort the data if necessary. 
** in DAG's FINISH_SAVING, it will log recovery data (DAGCommitStartedEvent, 
DAGFinishedEvent) and commit the/abort data if necessary
** For VertexGroupCommitStartedEvent / VertexGroupCommitFinishedEvent, I will 
run it in FinishSavingService and keep DAG in the state of RUNNING.
** DAGSubmittedEvent is an special event, it is been logged in the RPC call 
submitDAG, so it is not handled in the FinishSavingService.
** In recovery, still keep logging recovery event and commit in main 
AsycDispatcher.  Not so confident on moving it out of main AsyncDispatcher now, 
may leave it in another jira. 
 
I have implemented a prototype of this feature ( run tez examples successfully 
and TestAMRecovery successfully ) but still need some code refinement, will 
attach the patch soon. 
When I implement it, I still have the following questions, hope to get some 
feedback about them.

* initializing of committer is still in the main AsycDispather thread, is it 
acceptable ? 
* Can TerminateEvent been ignored when DAG/Vertex is in FINISH_SAVING ? IMO, I 
think it can been ignored. Because if we don't ignore it, we still need to 
abort the committer. Since for both ignore and not-ingore we both have to call 
commit or abort, I think ignore it is acceptable.





> Recovery data should not be written on AsyncDispatcher thread
> -------------------------------------------------------------
>
>                 Key: TEZ-992
>                 URL: https://issues.apache.org/jira/browse/TEZ-992
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Jeff Zhang
>         Attachments: DAG_FinishSaving.gv, Vertex_FinishSaving.gv
>
>
> This may block the DAG operations in case the recovery data needs to be 
> synchronously stored. The operations requiring this blocking operation should 
> change their state machines to wait for the store operation before moving 
> ahead. They will move ahead after they receive notification from the 
> RecoveryService that their operation has completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-992) Recovery data should not be written on AsyncDispatcher thread

Reply via email to