[
https://issues.apache.org/jira/browse/HADOOP-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704953#comment-14704953
]
Hudson commented on HADOOP-12317:
---------------------------------
FAILURE: Integrated in Hadoop-Mapreduce-trunk #2239 (See
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2239/])
HADOOP-12317. Applications fail on NM restart on some linux distro because NM
container recovery declares AM container as LOST (adhoot via rkanter) (rkanter:
rev 1e06299df82b98795124fe8a33578c111e744ff4)
*
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java
* hadoop-common-project/hadoop-common/CHANGES.txt
*
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
> Applications fail on NM restart on some linux distro because NM container
> recovery declares AM container as LOST
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-12317
> URL: https://issues.apache.org/jira/browse/HADOOP-12317
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Anubhav Dhoot
> Assignee: Anubhav Dhoot
> Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4046.002.patch, YARN-4046.002.patch,
> YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail
> because the signal syntax for process group may not work. We see errors in
> checking if process is alive during container recovery which causes the
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt
> recovered after RM restartAM Container for
> appattempt_1439244348718_0001_000001 exited with exitCode: 154
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)