[ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976360#comment-14976360
 ] 

Bikas Saha commented on TEZ-2581:
---------------------------------

tez-dag/src/main/java/org/apache/tez/dag/app/RecoveryParser.java
{code}    public void isRecoverable() {
      // DAG is not recoverable if vertex has committed and has completed the 
commit
      // but its full recovery events are not seen.
      for (Map.Entry<TezVertexID, VertexRecoveryData> entry : 
vertexRecoveryData.entrySet()) {
        TezVertexID vertexId = entry.getKey();
        VertexRecoveryData vertexData = entry.getValue();
        if (vertexData.getVertexFinishedEvent() != null) {
          // TODO only see full task events should be OK
          if (vertexCommitStatus.containsKey(vertexId) && 
vertexCommitStatus.get(vertexId)) {
            this.nonRecoverable = true;
            this.reason = "Vertex has been committed, but its full recovery 
events are not seen, vertexId=" + vertexId;
            return;
          }
{code}
Didn't quite understand this. What is the reasoning behind this? Also does 
"vertexData.getVertexFinishedEvent() != null" imply that full recovery events 
have not been seen? Should it be null (based on the code in isVertexFinished())?
Can we put the ".containsKey+.get" check in a method similar to 
isVertexGroupCommitted()?
Why are we separately looking at vertex group commit members? vertex group 
commit is a single operation that commits for all member vertices. Each member 
vertex does not have a separate commit operation.

{code}
LOG.info("Read HistoryEvent, eventType=" + historyEvent.getEventType() + ", 
event=" + historyEvent);
      LOG.info("attemptRecoveryPath:" + attemptPath);
      LOG.info("summaryPath:" + summaryFile);
    LOG.info("SummaryFile size:" + summaryFiles.size()); and other new logs
{code}
Log debug or remove? 

{code}          new Path(currentAttemptRecoveryDataDir, 
appId.toString().replace(
              "application", "dag")
              + "_1" + TezConstants.DAG_RECOVERY_RECOVER_FILE_SUFFIX);
{code} Why is it always _1 ?

typo - vertexGroupMemeberCommitStatus,VERTEX_INIT_GENREATED_EVENTS

How is DagRecoveryData.isRecoverable() different from isDAGRecoverable()?

Put into a common method? Same for task and attempt.
{code}            VertexRecoveryData vertexRecoveryData = 
recoveredDAGData.vertexRecoveryData.get(initGeneratedEvent.getVertexID());
            if (vertexRecoveryData == null) {
              vertexRecoveryData = new VertexRecoveryData();
              
recoveredDAGData.vertexRecoveryData.put(initGeneratedEvent.getVertexID(), 
vertexRecoveryData);
            }{code}

taGeneratedEvents should be in TaskAttemptRecoveryData, right? There could be 
multiple attempts for a task and each could generate events.


> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to