[
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634563#comment-16634563
]
Hitesh Sharma commented on TEZ-3996:
------------------------------------
Thanks for the feedback, Jason. We currently don't support relaunching the
external process within the processor, but in practice it isn't bad because we
simply fetch a new container or reuse an existing one for rerunning the task
with the new DMEs. In near future we will optimize this further by at least
trying to see if the external process manages to succeed with the older DME
then we need not proactively terminate it. Beyond this we can't do much at this
point as it breaks the contract we have with the external process which happens
to be the Scope runtime engine.
Let me look at the code a bit more on your suggestions and get back.
Thanks!
> Reorder input failed events before data movement events
> -------------------------------------------------------
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Hitesh Sharma
> Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for
> DataMovementEvent to arrive and then starts an external process to do some
> work. When a revocation happens then the processor recieves an
> InputFailedEvent, which tells it about the failed input, and we fail the
> processor as it is working on old inputs. When the new inputs are available
> then Tez restarts the processor and sends the InputFailedEvent along with all
> the DataMovementEvent which includes the older versions and the new version
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many
> times we see the older DataMovementEvent first at which our processor thinks
> it is good to start. We then receive the InputFailedEvent and the new version
> of DataMovementEvent, but that's late and the processor fails. This keeps
> repeating on every subsequent task attempt and the task fails.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)