[
https://issues.apache.org/jira/browse/HIVE-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal V resolved HIVE-6751.
---------------------------
Resolution: Invalid
Should be a tez fix.
> maxtaskfailures.per.node is set to too low a threshold
> ------------------------------------------------------
>
> Key: HIVE-6751
> URL: https://issues.apache.org/jira/browse/HIVE-6751
> Project: Hive
> Issue Type: Bug
> Reporter: Gopal V
>
> The node blacklisting results in a task retry system which can consume
> cluster resources excessively with queries which will eventually fail.
> For a large stage query, on a 20 node cluster, with a few failures a query
> can go back and re-run query stages multiple times till it eventually re-runs
> the broken reducer 3 times.
> The same vertex failing 3 times on a node is no reason to throw away all the
> shuffle data accumulated already on that.
> An alternative strategy is to kill a container after 3 tasks fail within it,
> because the error is occasionally due to bugs triggered due to container
> re-use (static variables, task cleanup isn't complete etc) and will succeed
> if run on a fresh container.
> The threshold should be ~3x no-of-containers for a node failure, when the
> containers are getting respawned for every 3rd failure.
--
This message was sent by Atlassian JIRA
(v6.2#6252)