[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992346#comment-14992346
]
Bikas Saha commented on TEZ-2581:
---------------------------------
DAGImpl.recoverTransition
Can we add some diagnostics saying DAG recovery failed. AND/OR DAG recovery
succeeded but status information may be partial.
TaskAttemptFinishedEvent.java/VertexInitGeneratedEvent.java
We could do builder.addTaGeneratedEvents() directly in the loop instead of
creating another list.
TezExampleBase.java
Methods made public for testing should have an @Private and @VisibleForTesting
annotation
DAGImpl.java
bq. Only the dag init event is synchronously handled (just for generating the
vertices). VertexEventRecover is handled asynchronously
I am wary that this may break after a future change made to the normal INIT
transition. Init transition has other logic for the normal code path whereas in
recoverTransition we only want to set the state to a known state for clients.
IMO, we could put the vertex creation code in a common method and just use that
to create vertices in recoverTransition. Then synchronously call
Vertex.handle(VertexRecoverEvent). Thoughts?
bq. No, we are sending list of vertexid. Before history event handling thread
process these event, these vertexId has already been extrated.
Are we sure? From what I see, a transform is just a view. So it is evaluated
afresh every item it is iterated and produces new objects. Could you please
check in a debugger? If this is indeed the case, then we shoudl avoid the
getVertex() lock contention. The same thing would hold inside
VertexGroupCommitFinishedEvent.java and other places that are using transforms.
TaskImpl.java
logJobHistoryTaskStartedEvent() has the side effect of setting the scheduleTime
(code moved from InitialScheduleTransition). IMO, these methods should be side
effect free.
Unresolved TODO.
Should this event have the isRecovery flag set?
{code}
// TODO use tFinishedEvent.getTerminationCause after TaskTerminationCause to
TaskFinishedEvent
task.eventHandler.handle(new TaskEventTermination(task.taskId,
TaskAttemptTerminationCause.UNKNOWN_ERROR,
tFinishedEvent.getDiagnostics()));{code}
bq. Now, I follow the semantics as before. Created a follow up jira to track
this. TEZ-2926
>From the code it looks like if taskRecovery fails then we end up calling
>addAndScheduleAttempt() instead of failing. Am I missing something?
{code} if (!recoveredData) {
task.successfulAttempt = null;
} else {
LOG.info("Recovered a successful attempt"
+ ", taskAttemptId=" + task.successfulAttempt.toString());
return TaskStateInternal.SUCCEEDED;
}
if (task.failedAttempts >= task.maxFailedAttempts) {
// Exceeded max attempts
task.finished(TaskStateInternal.FAILED);
return TaskStateInternal.FAILED;
} else {
task.addAndScheduleAttempt(getSchedulingCausalTA());
return TaskStateInternal.RUNNING;
}{code}
bq. I plan to attach the VertexInitGeneratedEvent to VertexInitedEvent just
like attach TaskAttemptGeneratedEvent to TaskAttemptFinished. May do it a
follow up jira.
Sure. But today, logVertexInitGeneratedEvent will be logged multiple times,
once for each input and we have multi-input vertices in Hive/Pig. Will that
work today?
Setting initedTime can stay in initializeVertex and be set in recovery flow
from recoveryData and normal flow from clock. Setting it in assignVertexManager
seems misplaced.
logVertexReconfigureDoneEvent() should have a check similar to
logVertexInitGeneratedEvent that prevents logging it into recovery log if
recovery log already has one. Otherwise, NoOpVertexManagers activities may log
a bogus reconfigureDoneEvent.
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch,
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch,
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch,
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf,
> TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)