[ 
https://issues.apache.org/jira/browse/TEZ-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197624#comment-17197624
 ] 

Rajesh Balamohan commented on TEZ-4233:
---------------------------------------

Thanks [~abstractdog]  for the patch. I went through the patch. This can happen 
mainly in K8s as it retains the hostname when the pod gets restarted. Patch 
looks good overall.

 
 # It would also be helpful to disable local fetch and check if the problem is 
removed in k8s.
 # There is a potential issue in "ShuffleScheduler::isFetcherHealthy", as it 
could trigger frequent re-execution on source side, depending on the failures 
(i.e even a small number of initial failures could end up triggering the 
threshold). This can be fixed in separate Jira if needed.

 

> Map task should be blamed earlier for local fetch failures
> ----------------------------------------------------------
>
>                 Key: TEZ-4233
>                 URL: https://issues.apache.org/jira/browse/TEZ-4233
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4233.01.patch
>
>
> Fetch failures can be a result of network issue or disk issue. Currently, AM 
> doesn't know about whether the original input read error happened because of 
> a local fetch failure or not. I think if a map output was reported as a 
> subject of local fetch failure, AM should respond earlier, and blame it as 
> soon as possible. Here is a hidden assumption that a disk read should never 
> fail (or relatively rarely compared to network issues).
> When I detected this issue, it was a Kubernetes based LLAP environment, where 
> a daemon completely disappeared and a new daemon - running reducer tasks - 
> assumed that it has map outputs locally, which wasn't the case. 
> This patch can help in container mode as well, as we can assume that a local 
> read should work, and if it's not, the original map output data should be 
> re-generated as soon as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to