[ 
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631261#comment-16631261
 ] 

Hitesh Sharma commented on TEZ-3996:
------------------------------------

A potential solution is to reorder the events in TaskReporter.

https://github.com/apache/tez/blob/64c04f1121ef1d04118e36b0e4fc3808205a50a8/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskReporter.java#L305

 
{code:java}
Collection<TezEvent> tezEvents = response.getEvents();

List<TezEvent> dataMovementEvents = new ArrayList<>();
List<TezEvent> inputFailedEvents = new ArrayList<>();
List<TezEvent> otherEvents = new ArrayList<>();
for (final TezEvent te : tezEvents) {
if (te.getEvent() instanceof DataMovementEvent) {
dataMovementEvents.add(te);
} else if (te.getEvent() instanceof InputFailedEvent) {
inputFailedEvents.add(te);
} else {
otherEvents.add(te);
}
}
// Reorder
tezEvents = new ArrayList<>();
tezEvents.addAll(inputFailedEvents);
tezEvents.addAll(dataMovementEvents);
tezEvents.addAll(otherEvents);
// Now raise events
task.handleEvents(tezEvents);
{code}
Thoughts on this approach?

> Reorder input failed events before data movement events
> -------------------------------------------------------
>
>                 Key: TEZ-3996
>                 URL: https://issues.apache.org/jira/browse/TEZ-3996
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Sharma
>            Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for 
> DataMovementEvent to arrive and then starts an external process to do some 
> work. When a revocation happens then the processor recieves an 
> InputFailedEvent, which tells it about the failed input, and we fail the 
> processor as it is working on old inputs. When the new inputs are available 
> then Tez restarts the processor and sends the InputFailedEvent along with all 
> the DataMovementEvent which includes the older versions and the new version 
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many 
> times we see the older DataMovementEvent first at which our processor thinks 
> it is good to start. We then receive the InputFailedEvent and the new version 
> of DataMovementEvent, but that's late and the processor fails. This keeps 
> repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to