[
https://issues.apache.org/jira/browse/TEZ-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277179#comment-14277179
]
Hitesh Shah commented on TEZ-1069:
----------------------------------
I am not sure if that is the approach I would have taken. My thinking was more
along the lines for querying the VertexManager to allow it to modify the task
specifications in such cases. Changing the resource is not enough. One would
also need to change the java opts. For the latter, we would need to write a
java opts parser.
Isn't it better to setup hooks in case of OOM failures for a VertexManager to
resize the task? Furthermore, a lot of OOM failures are due to data skew where
one task is affected but the rest are not.
Last question on when should this increase be done? Should it be done on each
attempt failure or only on the last attempt?
> Support ability to re-size a task attempt when previous attempts fail due to
> resource constraints
> -------------------------------------------------------------------------------------------------
>
> Key: TEZ-1069
> URL: https://issues.apache.org/jira/browse/TEZ-1069
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Attachments: TEZ-1069-1.patch
>
>
> Consider a case where attempts for the final stage in a long DAG fails due to
> out of memory. In such a scenario, the framework ( or via the base vertex
> manager ) should be able to change the task specifications on the fly to
> trigger a re-run with modified specs.
> Changes could be both java opts changes as well as container resource
> requirements.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)