[
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338228#comment-14338228
]
Jeff Zhang commented on TEZ-1019:
---------------------------------
[~hitesh] Thanks for review, attach new patch to address the review comments.
bq. in restoreFromEvent, the code goes through manually defined paths instead
of using existing transition functions resulting in duplication of logic.
It is limited to the current recovering process. Currently, we use the below
flow to recover
DAG::restoreFromEvent -> Vertex::restoreFromEvent -> Task::restoreFromEvent ->
TaskAttempt::restoreFromEvent -> DAG::RecoveryTranstion ->
Vertex::RecoveryTransition -> Task::RecoveryTransition ->
TaskAttempt::RecoveryTransition
So we have to manually call some function in Vertex::restoreFromEvent to create
tasks otherwise Task::restoreFromEvent will throw NPE because task has not been
created.
In theory, I think it is possible to completely align the recovery transition
and normal transition. For this, we need to refactor the current recovery
process. TEZ-1657 is for this.
We can first consolidate all the recovery logs to DagRecoveryData, and use this
data to recover the dag. And the dag will follow the normal state machine to
transite, when it needs to recover its vertices, we just need to extract
VertexRecoveryData from the DagRecoveryData and use it to recovery vertices.
The same for the task and taskattempt.
DAG::RecoveryTransition -> Vertex::RecoveryTransition ->
Task::RecoveryTransition -> TaskAttempt :: RecoveryTransition
But this change is too big, so I think we can put it in another jira.
> Re-factor routing of events to use common code path for normal and recovery
> flow.
> ---------------------------------------------------------------------------------
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch,
> TEZ-1019-5.patch, Tez-1019.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)