[ 
https://issues.apache.org/jira/browse/HIVE-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V resolved HIVE-6751.
---------------------------

    Resolution: Invalid

Should be a tez fix.

> maxtaskfailures.per.node is set to too low a threshold
> ------------------------------------------------------
>
>                 Key: HIVE-6751
>                 URL: https://issues.apache.org/jira/browse/HIVE-6751
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gopal V
>
> The node blacklisting results in a task retry system which can consume 
> cluster resources excessively with queries which will eventually fail.
> For a large stage query, on a 20 node cluster, with a few failures a query 
> can go back and re-run query stages multiple times till it eventually re-runs 
> the broken reducer 3 times.
> The same vertex failing 3 times on a node is no reason to throw away all the 
> shuffle data accumulated already on that.
> An alternative strategy is to kill a container after 3 tasks fail within it, 
> because the error is occasionally due to bugs triggered due to container 
> re-use  (static variables, task cleanup isn't complete etc) and will succeed 
> if run on a fresh container.
> The threshold should be ~3x no-of-containers for a node failure, when the 
> containers are getting respawned for every 3rd failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to