[ 
https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631901#comment-16631901
 ] 

Jason Lowe commented on TEZ-3996:
---------------------------------

I believe every input failed event arrives _after_ the corresponding data 
movement event.  The input failed event is to notify the task that a prior DME 
is no longer valid.  Arguably a better fix is to simply not send DMEs to tasks 
where we know the input has failed rather than send it and then invalidate it.  
What worries me about sending them in reverse order is that a task may 
interpret the latter DME event as "oh, now the input is good and here's where 
to get it."

InputFailedEvent has traditionally been used to indicate inputs that are likely 
not going to be fetchable from a task, and a task is free to ignore the input 
failure if it was able to successfully fetch the input that supposedly has 
failed.  It sounds like input failure is being redefined a bit in this context 
where somehow the input is retrievable but considered invalid?

> Reorder input failed events before data movement events
> -------------------------------------------------------
>
>                 Key: TEZ-3996
>                 URL: https://issues.apache.org/jira/browse/TEZ-3996
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Sharma
>            Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for 
> DataMovementEvent to arrive and then starts an external process to do some 
> work. When a revocation happens then the processor recieves an 
> InputFailedEvent, which tells it about the failed input, and we fail the 
> processor as it is working on old inputs. When the new inputs are available 
> then Tez restarts the processor and sends the InputFailedEvent along with all 
> the DataMovementEvent which includes the older versions and the new version 
> that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many 
> times we see the older DataMovementEvent first at which our processor thinks 
> it is good to start. We then receive the InputFailedEvent and the new version 
> of DataMovementEvent, but that's late and the processor fails. This keeps 
> repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to