Hitesh Sharma created TEZ-3996:
----------------------------------
Summary: Reorder input failed events before data movement events
Key: TEZ-3996
URL: https://issues.apache.org/jira/browse/TEZ-3996
Project: Apache Tez
Issue Type: Improvement
Reporter: Hitesh Sharma
We have a custom processor (AbstractLogicalIOProcessor) that waits for
DataMovementEvent to arrive and then starts an external process to do some
work. When a revocation happens then the processor recieves an
InputFailedEvent, which tells it about the failed input, and we fail the
processor as it is working on old inputs. When the new inputs are available
then Tez restarts the processor and sends the InputFailedEvent along with all
the DataMovementEvent which includes the older versions and the new version
that was revocated.
The issue we are seeing is that the events arrive out of order i.e. many times
we see the older DataMovementEvent first at which our processor thinks it is
good to start. We then receive the InputFailedEvent and the new version of
DataMovementEvent, but that's late and the processor fails. This keeps
repeating on every subsequent task attempt and the task fails.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)