[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2882:
----------------------------------
    Attachment: TEZ-2882.3.patch

Attaching the refactored patch based on the following.

In case of fetch failure, here is the high level set of actions that happen
- Inform AM about the failures. In case producer needs to be restarted, AM can 
take care of it (e.g based on thresholds or based on many number of consumers 
reporting issues or based on custom heuristics)
- Check whether consumer needs to be restarted. If overall shuffle phase is not 
healthy bail out. This is arrived at based on various parameters like 
                -- Whether individual attempt fetch failure has exceeded the 
threshold determined by abortFailureLimit (max 15 times). Abort limit of 15 
should be more than sufficient given the read/connect timeouts are set to 180 
seconds by default (15 * 180 = 2700 seconds ~ 45 minutes). Earlier it was 30 
which was way too high and might not be needed. checkReduerHealth() had a 
corner case, due to which consumer never bailed out. 
                -- whether overall fetching phase is healthy, based on the 
number of failures that have been reported w.r.t to the successful completions. 
In some cases, this might not be sufficient, so it ends up considering the 
number of failures that have happened since last successful completion.
                -- whether fetchers have made enough progress overall (e.g it 
could be expensive to kill it if it had already made enough progress and might 
want to delay restarting consumer until all other conditions are met.)
                -- whether fetchers were stalled for prolonged period of time 
since last update
                -- whether it had difficulty in terms of fetching from multiple 
attempts
                
- Penalize the host

Also, made TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_PERCENTAGE (20%), 
TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST (3), 
TEZ_RUNTIME_SHUFFLE_MAX_STALL_TIME_PERCENTAGE (50%) to be configurable instead 
of hard coding.  Didn't want users to change it and hence these weren't exposed 
earlier. 

failedShufflesSinceLastProgress should be ideally 
failedShufflesSinceLastCompletion. Intension is to reset it even when one of 
the attempts is able to report sucessful fetch, so that fetchers are given 
enough headroom before killing the consumer.  This does not affect any other 
code path. 

In future jira, it would be good to consider the fetch cost/idle time (& 
comparing with the stats from other resource). But at the same time not restart 
in case the fetch is happening too slow (e.g it is quite possible that the 
source is slow as it is serving lots of other requests and is really slow in 
serving this fetcher's output. In such situations, it shouldn't be killed.)


> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to