[
https://issues.apache.org/jira/browse/TEZ-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129161#comment-14129161
]
Hitesh Shah commented on TEZ-1559:
----------------------------------
Comments:
- in the future, it will be helpful to reviewers if white space cleanup is
done as a separate patch
- Can you explain why TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED is
needed? How does this affect recovery testing? Does setting the flush settings
to 0 i.e. immediate flush help?
{code}
LOG.info("Recovered a successful attempt"
+ ", taskAttemptId=" + task.successfulAttempt.toString());
- task.logJobHistoryTaskFinishedEvent();
+ if (task.recoveredState == TaskState.SUCCEEDED) {
+ task.logJobHistoryTaskFinishedEvent();
+ }
{code}
- the log message says a successful task attempt was recovered but there is
an additional task state check - why is this needed?
- is there a reason why vertex id cannot be used in tests and vertex name is
needed? Addition of vertex name into the proto for recovery events will
increase size of the file - though the increase may not be much but it will be
add 10-20 bytes depending on size of vertex name per entry into the file.
{code}
+ if (handleRemainingEventWhenStopped) {
+ LOG.info("Handle the remaining events in queue");
+ while(!eventQueue.isEmpty()) {
+ synchronized (lock) {
+ try {
+ ++eventsProcessed;
+ DAGHistoryEvent event = eventQueue.take();
+ handleRecoveryEvent(event);
+ } catch (Exception e) {
+ // For now, ignore any such errors as these are non-critical
+ // All summary event related errors are handled as critical
+ LOG.warn("Error handling recovery event", e);
+ }
+ }
+ }
+ }
{code}
- is there a need to track the counter?
- would be useful to log size of queue before processing events
{code}
+
tezConf.set(TezConfiguration.TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED,
"true");
+ tezConf.set(FAIL_ON_ATTEMPT, rand.nextInt(MAX_AM_ATTEMPT) + "");
+ tezConf.set(FAIL_ON_PARTIAL_FINISHED, "true");
+ tezConf.set(FAIL_ON_PARTIAL_FINISHED, "false");
{code}
- use conf.setBoolean or the appropriate such as setInt as needed. Likewise
in the get calls.
Is this patch meant to apply directly on trunk ? or after TEZ-850?
> Add system tests for AM recovery
> --------------------------------
>
> Key: TEZ-1559
> URL: https://issues.apache.org/jira/browse/TEZ-1559
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: Tez-1559.patch
>
>
> * [Fine-grained recovery task-level] In a vertex, task 0 is done task 1 is
> running. History flush happens. AM dies. Once AM is recovered, task 0 is not
> re-run. Task 1 is re-run.
> * [Data movement types] Test AM recovery with all data movement types
> including 1-1, broadcast, scatter-gather with/without shuffle. AM should die
> in 2 scenarios: first-vertex task finishes completely and partially.
> * [Kill AM many times] Set AM max attempt to high number. Kill many attempts.
> Last AM can still be recovered with latest AM history data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)