[ 
https://issues.apache.org/jira/browse/TEZ-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918730#comment-13918730
 ] 

Siddharth Seth commented on TEZ-902:
------------------------------------

bq. 1) A failed input is not obsoleted and thus the fetcher can get hung 
retrying the same input
This should already be handled. When assigning Input to a fetcher, they are all 
removed from the Host. In case of a failure - the Input which failed is not put 
back on this list (only subsequent queued Inputs are). Putting these Outputs on 
the obsolete list should not be required.

bq. 2) If there are multiple versions of an input (say for some reason the 
first version was killed and then regenerated) then the Fetcher tries to 
download all versions instead of the last version.
Shouldn't the ShuffleScheduler be getting an InputFailedEvent in this case - 
which it already processes to populate the obsoleteSet, which in turn is used 
when getting a list of Inputs which need to be fetched from a host (ignoring 
the ones which are considered to be failed). Is the InputFailedEvent not 
showing up ?

Also, I don't think we should be making the InputAttemptIdenrifer rename 
changes in this patch (big refactors as parts of patches in general) since that 
makes the patch tough to review. I think 80% of the patch is refactor. To me it 
looks like handling InputFailures should already be working - am not sure what 
exactly is changing to make this work better.

I think a couple of changes are required here - 
1) If the AM knows that an Input has been obsolete by a subsequent Input - this 
needs to be handled within the AM itself. The way tasks are sent events, 
there's no guarantee that the original event and the failed event will be seen 
in the same heartbeat - which means the Fetcher may already have kicked in for 
a known bad Input.
2) If a Fetcher is already running for a host - inform it about newly received 
InputFailedEvents - so that it can skip the relevant Input. This may actually 
not be requried / work very well - since a request is made for all inputs at 
the same time.


> Fetch failure issues in shuffle Input
> -------------------------------------
>
>                 Key: TEZ-902
>                 URL: https://issues.apache.org/jira/browse/TEZ-902
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-902.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to