[jira] [Commented] (TEZ-4233) Map task should be blamed earlier for local fetch failures

Jira Wed, 23 Sep 2020 05:22:05 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200788#comment-17200788
 ]


László Bodor commented on TEZ-4233:
-----------------------------------

thanks [~rajesh.balamohan], I addressed all your comments + found out that the 
signature optimization you proposed in 1) could also be applied to 
ShuffleScheduler.copyFailed.
+ changes:
1. refactored sendErrors
2. using a secret manager in TestShuffleHandler to test that the response 
headers are properly set in the response (in the initial commit I haven't 
validated that, and bumped into that on the cluster)

other than that, most of the patch is the refactoring in the test classes due 
to the change, affected test classes are:
{code}
mvn clean install -pl tez-plugins/tez-aux-services -pl tez-runtime-library -pl 
tez-api -pl tez-dag 
-Dtest=TestTaskAttempt,TestFetcher,TestShuffleScheduler,TestShuffleManager,TestShuffleHandler
{code}

> Map task should be blamed earlier for local fetch failures
> ----------------------------------------------------------
>
>                 Key: TEZ-4233
>                 URL: https://issues.apache.org/jira/browse/TEZ-4233
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4233.01.patch, TEZ-4233.02.patch, TEZ-4233.03.patch, 
> TEZ-4233.04.patch, TEZ-4233.05.patch
>
>
> Fetch failures can be a result of network issue or disk issue. Currently, AM 
> doesn't know about whether the original input read error happened because of 
> a local fetch failure or not. I think if a map output was reported as a 
> subject of local fetch failure, AM should respond earlier, and blame it as 
> soon as possible. Here is a hidden assumption that a disk read should never 
> fail (or relatively rarely compared to network issues).
> When I detected this issue, it was a Kubernetes based LLAP environment, where 
> a daemon completely disappeared and a new daemon - running reducer tasks - 
> assumed that it has map outputs locally, which wasn't the case. 
> This patch can help in container mode as well, as we can assume that a local 
> read should work, and if it's not, the original map output data should be 
> re-generated as soon as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TEZ-4233) Map task should be blamed earlier for local fetch failures

Reply via email to