[jira] [Commented] (TEZ-1559) Add system tests for AM recovery

Jeff Zhang (JIRA) Wed, 10 Sep 2014 17:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129409#comment-14129409
 ]


Jeff Zhang commented on TEZ-1559:
---------------------------------

bq. in the future, it will be helpful to reviewers if white space cleanup is 
done as a separate patch
Will do this ( will disable the save action setting in my eclipse )

bq. Can you explain why TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED is 
needed? How does this affect recovery testing? Does setting the flush settings 
to 0 i.e. immediate flush help?

Only SummaryEvent could be written immediate as flush setting to 0. Other 
non-summary event is written in another thread which is not guaranteed that it 
is successfully written to hdfs when AM is down. So I need 
TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED to write the remaining 
events in the queue. Besides I found another similar piece of code in 
ATSHistoryLoggingService for the similar purpose (serviceStop)
{code}
synchronized (lock) {
      if (!eventQueue.isEmpty()) {
        LOG.warn("ATSService being stopped"
            + ", eventQueueBacklog=" + eventQueue.size()
            + ", maxTimeLeftToFlush=" + maxTimeToWaitOnShutdown);
        long startTime = appContext.getClock().getTime();
        if (maxTimeToWaitOnShutdown > 0) {
          long endTime = startTime + maxTimeToWaitOnShutdown;
          while (endTime >= appContext.getClock().getTime()) {
            DAGHistoryEvent event = eventQueue.poll();
            if (event == null) {
              break;
            }
            try {
              handleEvent(event);
            } catch (Exception e) {
              LOG.warn("Error handling event", e);
              break;
            }
          }
        }
      }
    }
{code}

bq. is there a reason why vertex id cannot be used in tests and vertex name is 
needed. Addition of vertex name into the proto for recovery events will 
increase size of the file - though the increase may not be much but it will be 
add 10-20 bytes depending on size of vertex name per entry into the file.

VertexId is generated internally in tez which is not visible for users. so I 
think use vertex name is proper although it will add some cost on the recovery 
log. 

bq. is there a need to track the counter?
There's one jira for counter 
[TEZ-853|https://issues.apache.org/jira/browse/TEZ-853]. But I think add it 
here could enhance the test （ will add it in the next patch )

bq. would be useful to log size of queue before processing events
bq. use conf.setBoolean or the appropriate such as setInt as needed. Likewise 
in the get calls.
Will do it in the next patch.

bq. Is this patch meant to apply directly on trunk ? or after TEZ-850?
Can apply directly on trunk, no dependence on 
[TEZ-850|https://issues.apache.org/jira/browse/TEZ-850]

> Add system tests for AM recovery
> --------------------------------
>
>                 Key: TEZ-1559
>                 URL: https://issues.apache.org/jira/browse/TEZ-1559
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: Tez-1559.patch
>
>
> * [Fine-grained recovery task-level] In a vertex, task 0 is done task 1 is 
> running. History flush happens. AM dies. Once AM is recovered, task 0 is not 
> re-run. Task 1 is re-run.
> * [Data movement types] Test AM recovery with all data movement types 
> including 1-1, broadcast, scatter-gather with/without shuffle. AM should die 
> in 2 scenarios: first-vertex task finishes completely and partially.
> * [Kill AM many times] Set AM max attempt to high number. Kill many attempts. 
> Last AM can still be recovered with latest AM history data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1559) Add system tests for AM recovery

Reply via email to