[
https://issues.apache.org/jira/browse/SLIDER-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099041#comment-16099041
]
Billie Rinaldi commented on SLIDER-1233:
----------------------------------------
[~gsaha], thanks for taking a look. There are separate exit codes for exceeding
memory limits (ContainerExitStatus.KILLED_EXCEEDED_PMEM and
ContainerExitStatus.KILLED_EXCEEDED_VMEM) and these are translated into
ContainerOutcome.Failed_limits_exceeded rather than ContainerOutcome.Completed.
I am not sure of all the cases where KILLED_BY_RESOURCEMANAGER may occur, but
it appears to occur when an NM is decommissioned or resynced.
> Lost nodes should not contribute to container failures
> ------------------------------------------------------
>
> Key: SLIDER-1233
> URL: https://issues.apache.org/jira/browse/SLIDER-1233
> Project: Slider
> Issue Type: Bug
> Components: core
> Reporter: Billie Rinaldi
> Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1233.001.patch
>
>
> If a container completes due to an NM being lost, we should not count this
> towards container failures that may eventually cause the AM to fail the
> application. We are already using a ContainerOutcome of Completed (rather
> than Failed) for this type of container exit, so we just need to change the
> failure counting in that case. Other failure types associated with Completed
> are killed by the AM, killed by the RM, and killed after app completion, none
> of which need to contribute to container failures.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)