[
https://issues.apache.org/jira/browse/TEZ-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918730#comment-13918730
]
Siddharth Seth commented on TEZ-902:
------------------------------------
bq. 1) A failed input is not obsoleted and thus the fetcher can get hung
retrying the same input
This should already be handled. When assigning Input to a fetcher, they are all
removed from the Host. In case of a failure - the Input which failed is not put
back on this list (only subsequent queued Inputs are). Putting these Outputs on
the obsolete list should not be required.
bq. 2) If there are multiple versions of an input (say for some reason the
first version was killed and then regenerated) then the Fetcher tries to
download all versions instead of the last version.
Shouldn't the ShuffleScheduler be getting an InputFailedEvent in this case -
which it already processes to populate the obsoleteSet, which in turn is used
when getting a list of Inputs which need to be fetched from a host (ignoring
the ones which are considered to be failed). Is the InputFailedEvent not
showing up ?
Also, I don't think we should be making the InputAttemptIdenrifer rename
changes in this patch (big refactors as parts of patches in general) since that
makes the patch tough to review. I think 80% of the patch is refactor. To me it
looks like handling InputFailures should already be working - am not sure what
exactly is changing to make this work better.
I think a couple of changes are required here -
1) If the AM knows that an Input has been obsolete by a subsequent Input - this
needs to be handled within the AM itself. The way tasks are sent events,
there's no guarantee that the original event and the failed event will be seen
in the same heartbeat - which means the Fetcher may already have kicked in for
a known bad Input.
2) If a Fetcher is already running for a host - inform it about newly received
InputFailedEvents - so that it can skip the relevant Input. This may actually
not be requried / work very well - since a request is made for all inputs at
the same time.
> Fetch failure issues in shuffle Input
> -------------------------------------
>
> Key: TEZ-902
> URL: https://issues.apache.org/jira/browse/TEZ-902
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-902.1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)