[ 
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632689#comment-16632689
 ] 

Hitesh Sharma commented on TEZ-3996:
------------------------------------

{quote}

Wait, if it can handle this then why does it matter if they arrive in the same 
heartbeat? There's no guarantee that the events from a previous heartbeat have 
been fully processed before the asynchronous task heartbeat retrieves more, 
correct? The code should not be relying on whether these are arriving in a 
single heartbeat vs. multiple heartbeats. 

{quote}

 

What I meant was that say the DMEs come at time t0 and some failure happens at 
t0 + x, then at some time after t0 + x, the processor would learn about the 
failure and fail the task (we launch an external process in the processor and 
for various reasons can't restart the external process with the new version of 
DMEs).

Upon the subsequent retry of the attempt at time say, t1 (where t1 > t0 + x), 
we know about all the IFEs and DMEs so having some ordering between them would 
allow the processor to wait for the correct DMEs. 

It's a good point that even at time t1 the AM may send the events across 
multiple heartbeats and maybe the ordering needs to happen on the AM side. In 
that sense dropping the older DMEs before sending the events to the task at 
time t1 would work and even sending the higher version DMEs ahead of older 
versioned DMEs would also be fine.

> Reorder input failed events before data movement events
> -------------------------------------------------------
>
>                 Key: TEZ-3996
>                 URL: https://issues.apache.org/jira/browse/TEZ-3996
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Sharma
>            Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for 
> DataMovementEvent to arrive and then starts an external process to do some 
> work. When a revocation happens then the processor recieves an 
> InputFailedEvent, which tells it about the failed input, and we fail the 
> processor as it is working on old inputs. When the new inputs are available 
> then Tez restarts the processor and sends the InputFailedEvent along with all 
> the DataMovementEvent which includes the older versions and the new version 
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many 
> times we see the older DataMovementEvent first at which our processor thinks 
> it is good to start. We then receive the InputFailedEvent and the new version 
> of DataMovementEvent, but that's late and the processor fails. This keeps 
> repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to