[jira] [Commented] (TEZ-4233) Map task should be blamed earlier for local fetch failures

Rajesh Balamohan (Jira) Wed, 23 Sep 2020 15:30:29 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201142#comment-17201142
 ]


Rajesh Balamohan commented on TEZ-4233:
---------------------------------------

LGTM. +1.

> Map task should be blamed earlier for local fetch failures
> ----------------------------------------------------------
>
>                 Key: TEZ-4233
>                 URL: https://issues.apache.org/jira/browse/TEZ-4233
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4233.01.patch, TEZ-4233.02.patch, TEZ-4233.03.patch, 
> TEZ-4233.04.patch, TEZ-4233.05.patch
>
>
> Fetch failures can be a result of network issue or disk issue. Currently, AM 
> doesn't know about whether the original input read error happened because of 
> a local fetch failure or not. I think if a map output was reported as a 
> subject of local fetch failure, AM should respond earlier, and blame it as 
> soon as possible. Here is a hidden assumption that a disk read should never 
> fail (or relatively rarely compared to network issues).
> When I detected this issue, it was a Kubernetes based LLAP environment, where 
> a daemon completely disappeared and a new daemon - running reducer tasks - 
> assumed that it has map outputs locally, which wasn't the case. 
> This patch can help in container mode as well, as we can assume that a local 
> read should work, and if it's not, the original map output data should be 
> re-generated as soon as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TEZ-4233) Map task should be blamed earlier for local fetch failures

Reply via email to