[ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992346#comment-14992346
 ] 

Bikas Saha commented on TEZ-2581:
---------------------------------

DAGImpl.recoverTransition
Can we add some diagnostics saying DAG recovery failed. AND/OR DAG recovery 
succeeded but status information may be partial.

TaskAttemptFinishedEvent.java/VertexInitGeneratedEvent.java
We could do builder.addTaGeneratedEvents() directly in the loop instead of 
creating another list.

TezExampleBase.java
Methods made public for testing should have an @Private and @VisibleForTesting 
annotation

DAGImpl.java
bq. Only the dag init event is synchronously handled (just for generating the 
vertices). VertexEventRecover is handled asynchronously
I am wary that this may break after a future change made to the normal INIT 
transition. Init transition has other logic for the normal code path whereas in 
recoverTransition we only want to set the state to a known state for clients. 
IMO, we could put the vertex creation code in a common method and just use that 
to create vertices in recoverTransition. Then synchronously call 
Vertex.handle(VertexRecoverEvent). Thoughts?

bq. No, we are sending list of vertexid. Before history event handling thread 
process these event, these vertexId has already been extrated.
Are we sure? From what I see, a transform is just a view. So it is evaluated 
afresh every item it is iterated and produces new objects. Could you please 
check in a debugger? If this is indeed the case, then we shoudl avoid the 
getVertex() lock contention. The same thing would hold inside 
VertexGroupCommitFinishedEvent.java and other places that are using transforms.

TaskImpl.java
logJobHistoryTaskStartedEvent() has the side effect of setting the scheduleTime 
(code moved from InitialScheduleTransition). IMO, these methods should be side 
effect free.

Unresolved TODO.
Should this event have the isRecovery flag set?
{code}
// TODO use tFinishedEvent.getTerminationCause after TaskTerminationCause to 
TaskFinishedEvent
task.eventHandler.handle(new TaskEventTermination(task.taskId,
    TaskAttemptTerminationCause.UNKNOWN_ERROR, 
tFinishedEvent.getDiagnostics()));{code}

bq. Now, I follow the semantics as before. Created a follow up jira to track 
this. TEZ-2926
>From the code it looks like if taskRecovery fails then we end up calling 
>addAndScheduleAttempt() instead of failing. Am I missing something?
{code}      if (!recoveredData) {
        task.successfulAttempt = null;
      } else {
        LOG.info("Recovered a successful attempt"
            + ", taskAttemptId=" + task.successfulAttempt.toString());
        return TaskStateInternal.SUCCEEDED;
      }

      if (task.failedAttempts >= task.maxFailedAttempts) {
        // Exceeded max attempts
        task.finished(TaskStateInternal.FAILED);
        return TaskStateInternal.FAILED;
      } else {
        task.addAndScheduleAttempt(getSchedulingCausalTA());
        return TaskStateInternal.RUNNING;
      }{code}

bq.  I plan to attach the VertexInitGeneratedEvent to VertexInitedEvent just 
like attach TaskAttemptGeneratedEvent to TaskAttemptFinished. May do it a 
follow up jira.
Sure. But today, logVertexInitGeneratedEvent will be logged multiple times, 
once for each input and we have multi-input vertices in Hive/Pig. Will that 
work today?

Setting initedTime can stay in initializeVertex and be set in recovery flow 
from recoveryData and normal flow from clock. Setting it in assignVertexManager 
seems misplaced.

logVertexReconfigureDoneEvent() should have a check similar to 
logVertexInitGeneratedEvent that prevents logging it into recovery log if 
recovery log already has one. Otherwise, NoOpVertexManagers activities may log 
a bogus reconfigureDoneEvent.


> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, 
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to