[
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631261#comment-16631261
]
Hitesh Sharma commented on TEZ-3996:
------------------------------------
A potential solution is to reorder the events in TaskReporter.
https://github.com/apache/tez/blob/64c04f1121ef1d04118e36b0e4fc3808205a50a8/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskReporter.java#L305
{code:java}
Collection<TezEvent> tezEvents = response.getEvents();
List<TezEvent> dataMovementEvents = new ArrayList<>();
List<TezEvent> inputFailedEvents = new ArrayList<>();
List<TezEvent> otherEvents = new ArrayList<>();
for (final TezEvent te : tezEvents) {
if (te.getEvent() instanceof DataMovementEvent) {
dataMovementEvents.add(te);
} else if (te.getEvent() instanceof InputFailedEvent) {
inputFailedEvents.add(te);
} else {
otherEvents.add(te);
}
}
// Reorder
tezEvents = new ArrayList<>();
tezEvents.addAll(inputFailedEvents);
tezEvents.addAll(dataMovementEvents);
tezEvents.addAll(otherEvents);
// Now raise events
task.handleEvents(tezEvents);
{code}
Thoughts on this approach?
> Reorder input failed events before data movement events
> -------------------------------------------------------
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Hitesh Sharma
> Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for
> DataMovementEvent to arrive and then starts an external process to do some
> work. When a revocation happens then the processor recieves an
> InputFailedEvent, which tells it about the failed input, and we fail the
> processor as it is working on old inputs. When the new inputs are available
> then Tez restarts the processor and sends the InputFailedEvent along with all
> the DataMovementEvent which includes the older versions and the new version
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many
> times we see the older DataMovementEvent first at which our processor thinks
> it is good to start. We then receive the InputFailedEvent and the new version
> of DataMovementEvent, but that's late and the processor fails. This keeps
> repeating on every subsequent task attempt and the task fails.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)