[
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375512#comment-14375512
]
Jeff Zhang edited comment on TEZ-714 at 3/23/15 8:04 AM:
---------------------------------------------------------
Upload a new patch. [~bikassaha] Please help review it.
* Wrap the commit in the CallableEvent both in DAG & Vertex, but for the abort,
still call it inline. Make the abort asyn will complicate the patch, so still
keep it a sync call as before.
* Introduce new state COMMITTING for Vertex & DAG
** Vertex's COMMITTING means vertex is in the middle of committing, if vertex
has no committers or the option of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is
true, vertex would not to to COMMITTING state.
** DAG's COMMITTING has 2 cases, one is when
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true and all the vertices are
completed, another case is that TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is
false and all the vertices are completed, but still some vertex group
committers are running.
* Regarding the issue of "not sure why group-commit and non-group commit need
to be differentiated in different transitions.", I rename it to
NonFinalCommitCompletedTransition and FinalCommitCompletetionTransition (maybe
there's better names ). One mean the committer when
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and the other means
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true. The reason I differentiate
them is that for the NonFinalCommitCompletedEvent, we need to log the recovery
log of VertexGroupCommitCompletedEvent while it is not necessary for
FinalCommitCompletedEvent.
* Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl
we run the shared thread pool in the AsynDispatcher thread ( that means
Committer still run in the thread of AsynDispather) so this may hide some
potential issues and under this thread mode, it is not possible for test some
cases like kill dag while it is in committing. I am trying to think of ways to
simulate the shared thread pool in the unit test.
* For the some existing transition, like (RUNNING to ERROR due to INTERNAL
ERROR), I am not sure why it go to ERROR directly rather than TERMINATING.
Maybe it is to allow the client get the final status as early as possible.
was (Author: zjffdu):
Upload a new patch. [~bikassaha] Please help review it.
* Wrap the commit in the CallableEvent both in DAG & Vertex, but for the abort,
still call it inline. Make the abort asyn will complicate the patch, so still
keep it a sync call as before.
* Introduce new state COMMITTING for Vertex & DAG
** Vertex's COMMITTING means vertex is in the middle of committing, if vertex
has no committers or the option of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is
true, vertex would not to to COMMITTING state.
** DAG's COMMITTING has 2 cases, one is when
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true and all the vertices are
completed, another case is that TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is
false and all the vertices are completed, but still some vertex group
committers are running.
* Regarding the issue of "not sure why group-commit and non-group commit need
to be differentiated in different transitions.", I rename it to
NonFinalCommitCompletedTransition and FinalCommitCompletetionTransition (maybe
there's better names ). One mean the committer when
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and the other means
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true. The reason I differentiate
them is that for the NonFinalCommitCompletedEvent, we need to log the recovery
log of VertexGroupCommitCompletedEvent while it is not necessary for
FinalCommitCompletedEvent.
* Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl
we run the shared thread pool in the AsynDispatcher thread ( that means
Committer still run in the thread of AsynDispather) so this may hide some
potential issues and under this thread mode, it is not possible for test some
cases like kill dag while it is in committing. I am trying to think of ways to
simulate the shared thread pool in the unit test.
* For the some existing transition, like (RUNNING to ERROR due to INTERNAL
ERROR), I am not sure why it go to ERROR directly rather than TERMINATING.
Maybe it is to allow the client get the final status as earyl as possible.
> OutputCommitters should not run in the main AM dispatcher thread
> ----------------------------------------------------------------
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)