[
https://issues.apache.org/jira/browse/TEZ-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219187#comment-14219187
]
Jeff Zhang edited comment on TEZ-992 at 11/20/14 10:56 AM:
-----------------------------------------------------------
Seems this jira related to TEZ-714 closely, I try to resolve them together. I
create a FinishSavingService for logging critical recovery event and
commit/abort.
Attach the 2 state machines diagrams (DAG/Vertex) [~bikassaha], [~hitesh],
please help review the state machine and answer my following 2 questions
* The main change is that I add one additional state: FINISH_SAVING. ( before
going to SUCCEEDED/FAILED/KILLED, it should go to FINISH_SAVING first )
** In Vertex's FINISH_SAVING, it will log recovery data (VertexFinisheEvent,
VertexCommitStartedEvent) and commit/abort the data if necessary.
** in DAG's FINISH_SAVING, it will log recovery data (DAGCommitStartedEvent,
DAGFinishedEvent) and commit the/abort data if necessary
** For VertexGroupCommitStartedEvent / VertexGroupCommitFinishedEvent, I will
run it in FinishSavingService and keep DAG in the state of RUNNING.
** DAGSubmittedEvent is an special event, it is been logged in the RPC call
submitDAG, so it is not handled in the FinishSavingService.
** In recovery, still keep logging recovery event and commit in main
AsycDispatcher. Not so confident on moving it out of main AsyncDispatcher now,
may leave it in another jira.
I have implemented a prototype of this feature ( run tez examples successfully
and TestAMRecovery successfully ) but still need some code refinement, will
attach the patch soon.
When I implement it, I still have the following questions, hope to get some
feedback about them.
* initializing of committer is still in the main AsycDispather thread, is it
acceptable ?
* Can TerminateEvent been ignored when DAG/Vertex is in FINISH_SAVING ? IMO, I
think it can been ignored. Because if we don't ignore it, we still need to
abort the committer. Since for both ignore and not-ingore we both have to call
commit or abort, I think ignore it is acceptable.
was (Author: zjffdu):
Seems this jira related to TEZ-714 closely, I try to resolve them together. I
create a FinishSavingService for logging critical recovery event and
commit/abort.
Attach the 2 state machines diagrams (DAG/Vertex) [~bikassaha], [~hitesh],
please help review the state machine and answer my following 2 questions
* The main change is that I add one additional state: FINISH_SAVING. (
RUNNING/TERMINATING will transite to FINISH_SAVING first and then go to
SUCCEEDED/FAILED/KILLED )
** In Vertex's FINISH_SAVING, it will log recovery data (VertexFinisheEvent,
VertexCommitStartedEvent) and commit/abort the data if necessary.
** in DAG's FINISH_SAVING, it will log recovery data (DAGCommitStartedEvent,
DAGFinishedEvent) and commit the/abort data if necessary
** For VertexGroupCommitStartedEvent / VertexGroupCommitFinishedEvent, I will
run it in FinishSavingService and keep DAG in the state of RUNNING.
** DAGSubmittedEvent is an special event, it is been logged in the RPC call
submitDAG, so it is not handled in the FinishSavingService.
** In recovery, still keep logging recovery event and commit in main
AsycDispatcher. Not so confident on moving it out of main AsyncDispatcher now,
may leave it in another jira.
I have implemented a prototype of this feature ( run tez examples successfully
and TestAMRecovery successfully ) but still need some code refinement, will
attach the patch soon.
When I implement it, I still have the following questions, hope to get some
feedback about them.
* initializing of committer is still in the main AsycDispather thread, is it
acceptable ?
* Can TerminateEvent been ignored when DAG/Vertex is in FINISH_SAVING ? IMO, I
think it can been ignored. Because if we don't ignore it, we still need to
abort the committer. Since for both ignore and not-ingore we both have to call
commit or abort, I think ignore it is acceptable.
> Recovery data should not be written on AsyncDispatcher thread
> -------------------------------------------------------------
>
> Key: TEZ-992
> URL: https://issues.apache.org/jira/browse/TEZ-992
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Jeff Zhang
> Attachments: DAG_FinishSaving.gv, Vertex_FinishSaving.gv
>
>
> This may block the DAG operations in case the recovery data needs to be
> synchronously stored. The operations requiring this blocking operation should
> change their state machines to wait for the store operation before moving
> ahead. They will move ahead after they receive notification from the
> RecoveryService that their operation has completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)