[
https://issues.apache.org/jira/browse/TEZ-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201142#comment-17201142
]
Rajesh Balamohan commented on TEZ-4233:
---------------------------------------
LGTM. +1.
> Map task should be blamed earlier for local fetch failures
> ----------------------------------------------------------
>
> Key: TEZ-4233
> URL: https://issues.apache.org/jira/browse/TEZ-4233
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: TEZ-4233.01.patch, TEZ-4233.02.patch, TEZ-4233.03.patch,
> TEZ-4233.04.patch, TEZ-4233.05.patch
>
>
> Fetch failures can be a result of network issue or disk issue. Currently, AM
> doesn't know about whether the original input read error happened because of
> a local fetch failure or not. I think if a map output was reported as a
> subject of local fetch failure, AM should respond earlier, and blame it as
> soon as possible. Here is a hidden assumption that a disk read should never
> fail (or relatively rarely compared to network issues).
> When I detected this issue, it was a Kubernetes based LLAP environment, where
> a daemon completely disappeared and a new daemon - running reducer tasks -
> assumed that it has map outputs locally, which wasn't the case.
> This patch can help in container mode as well, as we can assume that a local
> read should work, and if it's not, the original map output data should be
> re-generated as soon as possible.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)