[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2882:
----------------------------------
    Attachment: TEZ-2882.1.patch

Changes:
- Reduced abortFailureLimit from 30 to 15. This should be fine for detecting 
read/connect issues. But could be aggressive when server returns 500 internal 
server (e.g server had disk issues and was able to read index file properly. 
But when streaming real contents, it encountered disk issues and ends up 
throwing 500 internal server error. In such cases, reducing this value from 30 
to 15 might cause little more aggressive failures.  This should be ok, as in 
case of 500 internal server, there is hardly a chance for the server to report 
healthy output).
- Added ability to detect failure rates since last progress. Task health is 
checked based on this and this would improve the accuracy of whether consumer 
has to be restarted or source has to be restarted. Also, consumer would be 
restarted only when errors have happened across 20% of the hosts (e.  Failing 
to fetch from 1 host, but succeeded from others – it’s like that 1 host's 
problem. Failing to fetch from a large number of hosts, it’s likely caused by 
the consumer).
- Added set of tests for this. Added a simple test for checking penalty as well.

Not covered in this:
- In case producer host gets restarted, consumer could get 404 error. This is 
handled in the same way as other type of read exceptions (e.g 500 internal 
server error, or shuffle header mismatch etc). Ideally, it might be good to 
restart the producer as soon as possible in AM side on observing 404 (instead 
of waiting for the retry cycle). This can be addressed in separate ticket, as 
it would not cause any job hang currently.

[~sseth] - Please review when you find time.

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to