Gopal V created HIVE-6751:
-----------------------------

             Summary: maxtaskfailures.per.node is set to too low a threshold
                 Key: HIVE-6751
                 URL: https://issues.apache.org/jira/browse/HIVE-6751
             Project: Hive
          Issue Type: Bug
            Reporter: Gopal V


The node blacklisting results in a task retry system which can consume cluster 
resources excessively with queries which will eventually fail.

For a large stage query, on a 20 node cluster, with a few failures a query can 
go back and re-run query stages multiple times till it eventually re-runs the 
broken reducer 3 times.

The same vertex failing 3 times on a node is no reason to throw away all the 
shuffle data accumulated already on that.

An alternative strategy is to kill a container after 3 tasks fail within it, 
because the error is occasionally due to bugs triggered due to container re-use 
 (static variables, task cleanup isn't complete etc) and will succeed if run on 
a fresh container.

The threshold should be ~3x no-of-containers for a node failure, when the 
containers are getting respawned for every 3rd failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to