[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967361#comment-14967361
]
Jason Lowe commented on TEZ-2581:
---------------------------------
I noticed the document doesn't discuss much about how user-provided code in the
Tez AM, like the vertex manager plugins, will be involved in the recovery. It
mentions a DummyVertexManager, but that isn't defined. Seems like the user
code would need to participate in some way, or we're changing the semantics of
the plugin API. A brief look at the latest patch indicates that once the
vertex has been configured a recovery will preclude the user-provided vertex
manager plugin from being used. This may violate assumptions made by the user
and the plugin code, and this would only be seen in the rare case of recovery.
Normal operations would not see the lack of plugin invocation nor the loss of
vertex state updates, etc. Seems like it would be a bit more friendly to help
the vertex manager be involved in the recovery process so it can continue to be
notified as the DAG progresses. Worst-case we can implement an
isRecoverySupported callback or something similar that can indicate whether the
user-provided code can help. We do the same for output committers.
The key in this process is that we can never allow any vertex to reconfigure
parallelism after any task has started. Therefore if we discover that the
committer doesn't support data recovery or what-not, we can't just assume we
can start the vertex over from scratch. TEZ-2589's description implies there
would be scenarios where we would recover a vertex from scratch, and doing so
after any task has started in the previous DAG attempt can lead to data loss or
duplication.
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch,
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch,
> TEZ-2581-WIP-6.patch, TezRecoveryRedesignProposal.pdf,
> TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)