[ 
https://issues.apache.org/jira/browse/TEZ-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129161#comment-14129161
 ] 

Hitesh Shah commented on TEZ-1559:
----------------------------------

Comments:
   - in the future, it will be helpful to reviewers if white space cleanup is 
done as a separate patch 
   - Can you explain why TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED is 
needed? How does this affect recovery testing? Does setting the flush settings 
to 0 i.e. immediate flush help?

{code}
               LOG.info("Recovered a successful attempt"
                   + ", taskAttemptId=" + task.successfulAttempt.toString());
-              task.logJobHistoryTaskFinishedEvent();
+              if (task.recoveredState == TaskState.SUCCEEDED) {
+                task.logJobHistoryTaskFinishedEvent();
+              }
{code}
  - the log message says a successful task attempt was recovered but there is 
an additional task state check - why is this needed? 

  - is there a reason why vertex id cannot be used in tests and vertex name is 
needed? Addition of vertex name into the proto for recovery events will 
increase size of the file - though the increase may not be much but it will be 
add 10-20 bytes depending on size of vertex name per entry into the file.

{code}
+    if (handleRemainingEventWhenStopped) {
+      LOG.info("Handle the remaining events in queue");
+      while(!eventQueue.isEmpty()) {
+        synchronized (lock) {
+          try {
+            ++eventsProcessed;
+            DAGHistoryEvent event = eventQueue.take();
+            handleRecoveryEvent(event);
+          } catch (Exception e) {
+            // For now, ignore any such errors as these are non-critical
+            // All summary event related errors are handled as critical
+            LOG.warn("Error handling recovery event", e);
+          }
+        }
+      }
+    }
{code}
  - is there a need to track the counter? 
  - would be useful to log size of queue before processing events

{code}
+    
tezConf.set(TezConfiguration.TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED,
 "true");
+    tezConf.set(FAIL_ON_ATTEMPT, rand.nextInt(MAX_AM_ATTEMPT) + "");
+      tezConf.set(FAIL_ON_PARTIAL_FINISHED, "true");
+      tezConf.set(FAIL_ON_PARTIAL_FINISHED, "false");
{code}
  - use conf.setBoolean or the appropriate such as setInt as needed. Likewise 
in the get calls. 

Is this patch meant to apply directly on trunk ? or after TEZ-850? 


> Add system tests for AM recovery
> --------------------------------
>
>                 Key: TEZ-1559
>                 URL: https://issues.apache.org/jira/browse/TEZ-1559
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: Tez-1559.patch
>
>
> * [Fine-grained recovery task-level] In a vertex, task 0 is done task 1 is 
> running. History flush happens. AM dies. Once AM is recovered, task 0 is not 
> re-run. Task 1 is re-run.
> * [Data movement types] Test AM recovery with all data movement types 
> including 1-1, broadcast, scatter-gather with/without shuffle. AM should die 
> in 2 scenarios: first-vertex task finishes completely and partially.
> * [Kill AM many times] Set AM max attempt to high number. Kill many attempts. 
> Last AM can still be recovered with latest AM history data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to