[jira] [Commented] (TEZ-1069) Support ability to re-size a task attempt when previous attempts fail due to resource constraints

Jeff Zhang (JIRA) Wed, 14 Jan 2015 05:24:01 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276901#comment-14276901
 ]


Jeff Zhang commented on TEZ-1069:
---------------------------------

Yes, [~hitesh]

I have a initial patch that can works.  Here's the main flow

* Identify whether the TaskAttempt is failed due OOM. 2 ways:
** ContainerExitStatus
** TaskAttemptCompleteEvent through heartbeat (  OOM exception may be caught 
and passed through heartbeat )
* Remember how many times of OOM failed task attempts for each task, and 
calculate the max value of this vertex. 
* Update the Resource of vertex and all its tasks based on the max OOM failed 
task attempts  : pow(1+increase_percent_per_OOM_failed_attempt,  
max_failed_attempt)

For the task attempt that is in the START_WAIT ( being scheduled by 
TaskSchedulerService), I didn't change it now. This may be the most complicated 
part if required. 

 




> Support ability to re-size a task attempt when previous attempts fail due to 
> resource constraints
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1069
>                 URL: https://issues.apache.org/jira/browse/TEZ-1069
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Shah
>            Assignee: Jeff Zhang
>
> Consider a case where attempts for the final stage in a long DAG fails due to 
> out of memory. In such a scenario, the framework  ( or via the base vertex 
> manager ) should be able to change the task specifications on the fly to 
> trigger a re-run with modified specs. 
> Changes could be both java opts changes as well as container resource 
> requirements. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1069) Support ability to re-size a task attempt when previous attempts fail due to resource constraints

Reply via email to