[
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634175#comment-16634175
]
Jason Lowe commented on TEZ-3996:
---------------------------------
bq. we launch an external process in the processor and for various reasons
can't restart the external process with the new version of DMEs
This limitation sounds like it lowers the fault tolerance in the DAG. If I
understand correctly, any retroactive failure of an upstream task forces any
active downstream task to fail because we cannot update the downstream task
with a new DME when the upstream task rerun completes. That means it could
only take four upstream task attempt reruns, across four different upstream
tasks, to fail a downstream vertex if the upstream re-runs were spread out
sufficiently in time and a downstream attempt was relaunched in-between each
upstream retroactive failure. So instead of any one upstream task failing four
times to fail the DAG, it becomes any four attempts _across_ the upstream tasks
worst-case.
Dropping the DME event seems like the right approach, although I worry a bit
that this may be an expensive thing to do on the AM side with a large number of
upstream and downstream tasks. We may need to refactor how those are tracked
AM-side. Another approach which isn't as clean but might scale better is to
have the AM send over an event when the task attempt is "up to date" with
events -- in other words, the pending event queue is drained on the AM side and
it could be a while before more events are sent to the task attempt. Then a
downstream task can load up all the events, filtering DMEs that have been
invalidated by later IFEs, until it receives the special, "up to date" event
which indicates it's OK to start the processing of any valid DMEs received so
far.
> Reorder input failed events before data movement events
> -------------------------------------------------------
>
> Key: TEZ-3996
> URL: https://issues.apache.org/jira/browse/TEZ-3996
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Hitesh Sharma
> Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for
> DataMovementEvent to arrive and then starts an external process to do some
> work. When a revocation happens then the processor recieves an
> InputFailedEvent, which tells it about the failed input, and we fail the
> processor as it is working on old inputs. When the new inputs are available
> then Tez restarts the processor and sends the InputFailedEvent along with all
> the DataMovementEvent which includes the older versions and the new version
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many
> times we see the older DataMovementEvent first at which our processor thinks
> it is good to start. We then receive the InputFailedEvent and the new version
> of DataMovementEvent, but that's late and the processor fails. This keeps
> repeating on every subsequent task attempt and the task fails.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)