[
https://issues.apache.org/jira/browse/TEZ-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276901#comment-14276901
]
Jeff Zhang commented on TEZ-1069:
---------------------------------
Yes, [~hitesh]
I have a initial patch that can works. Here's the main flow
* Identify whether the TaskAttempt is failed due OOM. 2 ways:
** ContainerExitStatus
** TaskAttemptCompleteEvent through heartbeat ( OOM exception may be caught
and passed through heartbeat )
* Remember how many times of OOM failed task attempts for each task, and
calculate the max value of this vertex.
* Update the Resource of vertex and all its tasks based on the max OOM failed
task attempts : pow(1+increase_percent_per_OOM_failed_attempt,
max_failed_attempt)
For the task attempt that is in the START_WAIT ( being scheduled by
TaskSchedulerService), I didn't change it now. This may be the most complicated
part if required.
> Support ability to re-size a task attempt when previous attempts fail due to
> resource constraints
> -------------------------------------------------------------------------------------------------
>
> Key: TEZ-1069
> URL: https://issues.apache.org/jira/browse/TEZ-1069
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
>
> Consider a case where attempts for the final stage in a long DAG fails due to
> out of memory. In such a scenario, the framework ( or via the base vertex
> manager ) should be able to change the task specifications on the fly to
> trigger a re-run with modified specs.
> Changes could be both java opts changes as well as container resource
> requirements.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)