Gopal V created HIVE-6751:
-----------------------------
Summary: maxtaskfailures.per.node is set to too low a threshold
Key: HIVE-6751
URL: https://issues.apache.org/jira/browse/HIVE-6751
Project: Hive
Issue Type: Bug
Reporter: Gopal V
The node blacklisting results in a task retry system which can consume cluster
resources excessively with queries which will eventually fail.
For a large stage query, on a 20 node cluster, with a few failures a query can
go back and re-run query stages multiple times till it eventually re-runs the
broken reducer 3 times.
The same vertex failing 3 times on a node is no reason to throw away all the
shuffle data accumulated already on that.
An alternative strategy is to kill a container after 3 tasks fail within it,
because the error is occasionally due to bugs triggered due to container re-use
(static variables, task cleanup isn't complete etc) and will succeed if run on
a fresh container.
The threshold should be ~3x no-of-containers for a node failure, when the
containers are getting respawned for every 3rd failure.
--
This message was sent by Atlassian JIRA
(v6.2#6252)