[jira] [Commented] (TEZ-1069) Support ability to re-size a task attempt when previous attempts fail due to resource constraints

Hitesh Shah (JIRA) Tue, 27 Jan 2015 15:07:51 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294381#comment-14294381
 ]


Hitesh Shah commented on TEZ-1069:
----------------------------------

Comments: 

- why is "maxOOMFailedTaskAttempts" needed? I am not sure why the need to reset 
it each time i.e. "maxOOMFailedTaskAttempts = numOOMfailedAttempts;" 
- how are you ensuring that the new memory does not exceed YARN's max limits? 
- how are the java opts being changed? Especially in the case where the user 
has specified their own Xmx, etc. This will need the VertexManager here as user 
settings cannot be simply overridden. 
- Also, there should be changes done to log this for history. 
- would be good to add tests. 

It might be good to split this out into 2 jiras. A new jira which tracks the 
fact that containers are getting killed due to OOM and updates the appropriate 
termination cause. And this jira can look to figure out how to correctly 
re-size. 




 



> Support ability to re-size a task attempt when previous attempts fail due to 
> resource constraints
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1069
>                 URL: https://issues.apache.org/jira/browse/TEZ-1069
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Shah
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1069-1.patch
>
>
> Consider a case where attempts for the final stage in a long DAG fails due to 
> out of memory. In such a scenario, the framework  ( or via the base vertex 
> manager ) should be able to change the task specifications on the fly to 
> trigger a re-run with modified specs. 
> Changes could be both java opts changes as well as container resource 
> requirements. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1069) Support ability to re-size a task attempt when previous attempts fail due to resource constraints

Reply via email to