[ 
https://issues.apache.org/jira/browse/TEZ-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200433#comment-17200433
 ] 

Rajesh Balamohan commented on TEZ-4233:
---------------------------------------

Thanks [~abstractdog] for the revised patch.

1. org.a.t.r.l.common.shuffle.Fetcher: Reporting  error back on "isLocalFetch" 
can be simplified, since HostFetchResult has "InputAttemptFetchFailure". This 
already has reference on whether it is a local fetch failure or not.
So in the case of "setupLocalDiskFetch()", it could just set those details 
which could simplify the changes in "FetcherCallback" (i.e avoid isLocalFetch)

2. In ShuffleHandler, "verifyRequest" in exception handling codepath can be 
avoided. Is it possible to reuse "sendError()" with 
"ShuffleHandlerError.DISK_ERROR_EXCEPTION" message packed in?

3. Terminology on "isLocalFetch" vs "isLocalFetchFailure" due to 
"DISK_ERROR_EXCEPTION" are confusing. If ShuffleHandler reports 
"DISK_ERROR_EXCEPTION", you are basically letting the source task to restart. 
It would be good to make this explicit in "InputReadErrorEvent" instead of 
clubbing with isLocalFetch. "TaskAttemptImpl" can be modified accordingly 
(something like "!readErrorEvent.isLocalFetch() && 
!readErrorEvent.isDiskErrorAtSource")

 

> Map task should be blamed earlier for local fetch failures
> ----------------------------------------------------------
>
>                 Key: TEZ-4233
>                 URL: https://issues.apache.org/jira/browse/TEZ-4233
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4233.01.patch, TEZ-4233.02.patch, TEZ-4233.03.patch
>
>
> Fetch failures can be a result of network issue or disk issue. Currently, AM 
> doesn't know about whether the original input read error happened because of 
> a local fetch failure or not. I think if a map output was reported as a 
> subject of local fetch failure, AM should respond earlier, and blame it as 
> soon as possible. Here is a hidden assumption that a disk read should never 
> fail (or relatively rarely compared to network issues).
> When I detected this issue, it was a Kubernetes based LLAP environment, where 
> a daemon completely disappeared and a new daemon - running reducer tasks - 
> assumed that it has map outputs locally, which wasn't the case. 
> This patch can help in container mode as well, as we can assume that a local 
> read should work, and if it's not, the original map output data should be 
> re-generated as soon as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to