[
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2882:
----------------------------------
Attachment: TEZ-2882.1.patch
Changes:
- Reduced abortFailureLimit from 30 to 15. This should be fine for detecting
read/connect issues. But could be aggressive when server returns 500 internal
server (e.g server had disk issues and was able to read index file properly.
But when streaming real contents, it encountered disk issues and ends up
throwing 500 internal server error. In such cases, reducing this value from 30
to 15 might cause little more aggressive failures. This should be ok, as in
case of 500 internal server, there is hardly a chance for the server to report
healthy output).
- Added ability to detect failure rates since last progress. Task health is
checked based on this and this would improve the accuracy of whether consumer
has to be restarted or source has to be restarted. Also, consumer would be
restarted only when errors have happened across 20% of the hosts (e. Failing
to fetch from 1 host, but succeeded from others – it’s like that 1 host's
problem. Failing to fetch from a large number of hosts, it’s likely caused by
the consumer).
- Added set of tests for this. Added a simple test for checking penalty as well.
Not covered in this:
- In case producer host gets restarted, consumer could get 404 error. This is
handled in the same way as other type of read exceptions (e.g 500 internal
server error, or shuffle header mismatch etc). Ideally, it might be good to
restart the producer as soon as possible in AM side on observing 404 (instead
of waiting for the retry cycle). This can be addressed in separate ticket, as
it would not cause any job hang currently.
[~sseth] - Please review when you find time.
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)