[jira] [Commented] (TEZ-1069) Support ability to re-size a task attempt when previous attempts fail due to resource constraints

Hitesh Shah (JIRA) Wed, 14 Jan 2015 08:36:46 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277179#comment-14277179
 ]


Hitesh Shah commented on TEZ-1069:
----------------------------------

I am not sure if that is the approach I would have taken. My thinking was more 
along the lines for querying the VertexManager to allow it to modify the task 
specifications in such cases. Changing the resource is not enough. One would 
also need to change the java opts. For the latter, we would need to write a 
java opts parser. 

Isn't it better to setup hooks in case of OOM failures for a VertexManager to 
resize the task? Furthermore, a lot of OOM failures are due to data skew where 
one task is affected but the rest are not. 
 
Last question on when should this increase be done? Should it be done on each 
attempt failure or only on the last attempt? 

> Support ability to re-size a task attempt when previous attempts fail due to 
> resource constraints
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1069
>                 URL: https://issues.apache.org/jira/browse/TEZ-1069
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Shah
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1069-1.patch
>
>
> Consider a case where attempts for the final stage in a long DAG fails due to 
> out of memory. In such a scenario, the framework  ( or via the base vertex 
> manager ) should be able to change the task specifications on the fly to 
> trigger a re-run with modified specs. 
> Changes could be both java opts changes as well as container resource 
> requirements. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1069) Support ability to re-size a task attempt when previous attempts fail due to resource constraints

Reply via email to