[
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632502#comment-16632502
]
Jason Lowe commented on TEZ-3996:
---------------------------------
bq. Arguably a better fix is to simply not send DMEs to tasks where we know the
input has failed rather than send it and then invalidate it.
What I meant here is to neither send the DME nor the input failed events. In
other words, if we know a DME is bad just don't tell the task at all about it
at all. It's just a waste, right? We can simply wait until a valid DME is
generated later.
bq. The happy path cases go like task receiving DMEs and then some IFEs
(typically after some time of receiving the initial DME). This is fine and the
processor can deal with it and in our case we mark the processor as a non-fatal
failure and retry.
Wait, if it can handle this then why does it matter if they arrive in the same
heartbeat? There's no guarantee that the events from a previous heartbeat have
been fully processed before the asynchronous task heartbeat retrieves more,
correct? The code should not be relying on whether these are arriving in a
single heartbeat vs. multiple heartbeats. We should address the core problem
of an IFE arriving too early relative to a DME -- that seems to be the crux of
the issue. I'm not seeing why the relative time between DME and IFE is
relevant to the nature of how the failure event is processed re: fatal vs.
non-fatal.
> Reorder input failed events before data movement events
> -------------------------------------------------------
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Hitesh Sharma
> Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for
> DataMovementEvent to arrive and then starts an external process to do some
> work. When a revocation happens then the processor recieves an
> InputFailedEvent, which tells it about the failed input, and we fail the
> processor as it is working on old inputs. When the new inputs are available
> then Tez restarts the processor and sends the InputFailedEvent along with all
> the DataMovementEvent which includes the older versions and the new version
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many
> times we see the older DataMovementEvent first at which our processor thinks
> it is good to start. We then receive the InputFailedEvent and the new version
> of DataMovementEvent, but that's late and the processor fails. This keeps
> repeating on every subsequent task attempt and the task fails.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)