[
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2882:
----------------------------------
Attachment: TEZ-2882.3.patch
Attaching the refactored patch based on the following.
In case of fetch failure, here is the high level set of actions that happen
- Inform AM about the failures. In case producer needs to be restarted, AM can
take care of it (e.g based on thresholds or based on many number of consumers
reporting issues or based on custom heuristics)
- Check whether consumer needs to be restarted. If overall shuffle phase is not
healthy bail out. This is arrived at based on various parameters like
-- Whether individual attempt fetch failure has exceeded the
threshold determined by abortFailureLimit (max 15 times). Abort limit of 15
should be more than sufficient given the read/connect timeouts are set to 180
seconds by default (15 * 180 = 2700 seconds ~ 45 minutes). Earlier it was 30
which was way too high and might not be needed. checkReduerHealth() had a
corner case, due to which consumer never bailed out.
-- whether overall fetching phase is healthy, based on the
number of failures that have been reported w.r.t to the successful completions.
In some cases, this might not be sufficient, so it ends up considering the
number of failures that have happened since last successful completion.
-- whether fetchers have made enough progress overall (e.g it
could be expensive to kill it if it had already made enough progress and might
want to delay restarting consumer until all other conditions are met.)
-- whether fetchers were stalled for prolonged period of time
since last update
-- whether it had difficulty in terms of fetching from multiple
attempts
- Penalize the host
Also, made TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_PERCENTAGE (20%),
TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST (3),
TEZ_RUNTIME_SHUFFLE_MAX_STALL_TIME_PERCENTAGE (50%) to be configurable instead
of hard coding. Didn't want users to change it and hence these weren't exposed
earlier.
failedShufflesSinceLastProgress should be ideally
failedShufflesSinceLastCompletion. Intension is to reset it even when one of
the attempts is able to report sucessful fetch, so that fetchers are given
enough headroom before killing the consumer. This does not affect any other
code path.
In future jira, it would be good to consider the fetch cost/idle time (&
comparing with the stats from other resource). But at the same time not restart
in case the fetch is happening too slow (e.g it is quite possible that the
source is slow as it is serving lots of other requests and is really slow in
serving this fetcher's output. In such situations, it shouldn't be killed.)
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)