[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953968#comment-14953968
 ] 

Bikas Saha commented on TEZ-2882:
---------------------------------

Some uber comments.
1) Could you please outline the current logic to fail a consumer task (as 
designed). This would help anyone trying to follow this jira/failures in 
general and would be relevant to 2)
2) IMO, its probably time we look holistically at 1 and see if the whole things 
needs a revamp that easy to understand and configure and debug when things go 
wrong.
E.g. there are still a bunch of existing threshold values (e.g. 
abortFailureLimit = Math.max(15, numberOfInputs / 10); or 
MAX_ALLOWED_FAILED_FETCH_ATTEMPT_PERCENT) or added in this patch (e.g. 
PERCENTAGE_HOST_FETCH_FAILURE) that are hard coded and that itself makes them 
susceptible to cluster/error scope. E.g. few failures in a large cluster may 
never trigger any of these because 0.2 of a 500 node cluster is 100 nodes. 
Perhaps making them configurable and tuning them up and down would alleviate 
issues without needing to add more logic.

However, assuming logic does need a refresh, it may be useful to look at both 
at an uber level and micro level.
E.g. uber level - should the penalty period be proportional to the actual data 
that has been fetched w.r.t. the amount of time spent waiting and occupying a 
slot. E.g. if a task has read 100MB over 50 inputs and then keeps waiting for 
10mins.
E.g. micro level - there are still 2 places in the code that try to abort the 
task - one is just before checkAndInformAM(), which this patch is not touching 
and the other is checkReducerHealth() which this task is touching. Perhaps we 
should look at consolidating all of this into a single place to simplify things.

On the patch itself, it looks like failedShufflesSinceLastProgress is reset on 
every successful read and not really since last progress. That would mean that 
every successful fetch would end up resetting the clock. So newer completed 
outputs would keep resetting older completed outputs (that have a larger chance 
of being lost). Should failureCounts be counted on a per output level 
independent of other things so that we can use them in a consistent logical 
manner.
I am a little wary about the failed across nodes logic. Firstly, it based on a 
new set of thresholds and so it may suffer from the same blindspot phenomena 
that the current set of thresholds are susceptible too. ie. there can be a 
combination of nodes + failures counts that may make this hang indefinitely. 
Secondly, without understanding the network topology it may be accurate to 
decide that failures of 3 sources implies that the consumer is bad. The 3 
sources could be behind a common rack switch and restarting the consumer will 
not fix the situation. Finding correlated failures is a global activity that 
may be better of done at the AM than in the tasks. Tasks could have local 
heuristics e.g. if I cannot read the new version of the previously lost output 
then perhaps I am the one thats bad.

A different way to think about consumer failure would be to look at it from a 
fetchCost+idleTime tradeoff perspective. How much time/resource am I wasting 
compared to the resource I have spent fetching partial data.

Thoughts?

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to