[jira] [Resolved] (HIVE-6751) maxtaskfailures.per.node is set to too low a threshold

Gopal V (JIRA) Tue, 25 Mar 2014 19:02:26 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gopal V resolved HIVE-6751.
---------------------------

    Resolution: Invalid

Should be a tez fix.

> maxtaskfailures.per.node is set to too low a threshold
> ------------------------------------------------------
>
>                 Key: HIVE-6751
>                 URL: https://issues.apache.org/jira/browse/HIVE-6751
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gopal V
>
> The node blacklisting results in a task retry system which can consume 
> cluster resources excessively with queries which will eventually fail.
> For a large stage query, on a 20 node cluster, with a few failures a query 
> can go back and re-run query stages multiple times till it eventually re-runs 
> the broken reducer 3 times.
> The same vertex failing 3 times on a node is no reason to throw away all the 
> shuffle data accumulated already on that.
> An alternative strategy is to kill a container after 3 tasks fail within it, 
> because the error is occasionally due to bugs triggered due to container 
> re-use  (static variables, task cleanup isn't complete etc) and will succeed 
> if run on a fresh container.
> The threshold should be ~3x no-of-containers for a node failure, when the 
> containers are getting respawned for every 3rd failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (HIVE-6751) maxtaskfailures.per.node is set to too low a threshold

Reply via email to