[
https://issues.apache.org/jira/browse/TEZ-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129530#comment-14129530
]
Hitesh Shah commented on TEZ-1559:
----------------------------------
bq. VertexId is generated internally in tez which is not visible for users. so
I think use vertex name is proper although it will add some cost on the
recovery log.
- Recovery data is meant to be internal and not exposed to users in any
case.
bq. is there a need to track the counter?
- Sorry - was not clear. This was for the counter in serviceStop() loop which
is tracking the no. of events processed.
bq. TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED
- If internal only and for testing, it might be better to put both of them in
RecoveryService and not in TezConfiguration which is public facing.
> Add system tests for AM recovery
> --------------------------------
>
> Key: TEZ-1559
> URL: https://issues.apache.org/jira/browse/TEZ-1559
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: Tez-1559-2.patch, Tez-1559.patch
>
>
> * [Fine-grained recovery task-level] In a vertex, task 0 is done task 1 is
> running. History flush happens. AM dies. Once AM is recovered, task 0 is not
> re-run. Task 1 is re-run.
> * [Data movement types] Test AM recovery with all data movement types
> including 1-1, broadcast, scatter-gather with/without shuffle. AM should die
> in 2 scenarios: first-vertex task finishes completely and partially.
> * [Kill AM many times] Set AM max attempt to high number. Kill many attempts.
> Last AM can still be recovered with latest AM history data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)