[
https://issues.apache.org/jira/browse/HADOOP-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908881#comment-14908881
]
Wangda Tan commented on HADOOP-12441:
-------------------------------------
The root cause of this issue is:
There're two different kinds of {{kill}} in popular Linux distributions, one
comes from bash, which a bash built in. Another one comes from system, located
at /bin/kill.
When running {{which kill}}, it shows /bin/kill, but if running {{type -a
kill}}, it tells you the truth:
{code}
user@workernode1:~$ type -a kill
kill is a shell builtin
kill is /bin/kill
{code}
After HADOOP-12317, kill command becomes:
bq. kill -0 -- -<pid>
Bash built-in kill cmd can support syntax like this, but /bin/kill doesn't
support it. By some reason, Hadoop Shell pickes up /bin/kill instead of bash
built-in kill.
Steps to reproduce this issue (Ubuntu 12.04)
1) Run {{nohup sleep 1000 &}}
2) Run {{ps -ef | grep sleep}}
{code}
hrt_qa@workernode1:~$ ps -ef | grep sleep
hrt_qa 46531 44212 0 23:29 pts/1 00:00:00 sleep 1000
{code}
3) Run {{kill (or bin/kill) -0 -- -44212}}, kill and /bin/kill has different
behavior.
Tested Ubuntu 14, it doesn't have this issue.
> Fix kill command execution under Ubuntu 12
> ------------------------------------------
>
> Key: HADOOP-12441
> URL: https://issues.apache.org/jira/browse/HADOOP-12441
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Wangda Tan
> Priority: Critical
>
> After HADOOP-12317, kill command's execution will be failure under Ubuntu12.
> After NM restarts, it cannot get if a process is alive or not via pid of
> containers, and it cannot kill process correctly when RM/AM tells NM to kill
> a container.
> Logs from NM (customized logs):
> {code}
> 2015-09-25 21:58:59,348 INFO nodemanager.DefaultContainerExecutor
> (DefaultContainerExecutor.java:containerIsAlive(431)) - ==================
> check alive cmd:[[Ljava.lang.String;@496e442d]
> 2015-09-25 21:58:59,349 INFO nodemanager.NMAuditLogger
> (NMAuditLogger.java:logSuccess(89)) - USER=hrt_qa IP=10.0.1.14
> OPERATION=Stop Container Request TARGET=ContainerManageImpl
> RESULT=SUCCESS APPID=application_1443218269460_0001
> CONTAINERID=container_1443218269460_0001_01_000001
> 2015-09-25 21:58:59,363 INFO nodemanager.DefaultContainerExecutor
> (DefaultContainerExecutor.java:containerIsAlive(438)) -
> ===========================
> ExitCodeException exitCode=1: ERROR: garbage process ID "--".
> Usage:
> kill pid ... Send SIGTERM to every process listed.
> kill signal pid ... Send a signal to every process listed.
> kill -s signal pid ... Send a signal to every process listed.
> kill -l List all signal names.
> kill -L List all signal names in a nice table.
> kill -l signal Convert between signal numbers and names.
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:550)
> at org.apache.hadoop.util.Shell.run(Shell.java:461)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:727)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.containerIsAlive(DefaultContainerExecutor.java:432)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.signalContainer(DefaultContainerExecutor.java:401)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:419)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)