[
https://issues.apache.org/jira/browse/HADOOP-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weiwei Yang updated HADOOP-13837:
---------------------------------
Description:
*Reproduce steps*
# Setup a hadoop cluster
# Stop resource manager : yarn --daemon stop resourcemanager
# Stop node manager : yarn --daemon stop nodemanager
WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill
with kill -9
ERROR: Unable to kill 20325
it always gets "Unable to kill <nm_pid>" error message, this gives user
impression there is something wrong with the node manager process because it
was not able to be forcibly killed. But in fact, the kill command works as
expected.
This was because hadoop-functions.sh did not check process existence after kill
properly. Currently it checks the process liveness right after the kill command
{code}
...
kill -9 "${pid}" >/dev/null 2>&1
if ps -p "${pid}" > /dev/null 2>&1; then
hadoop_error "ERROR: Unable to kill ${pid}"
...
{code}
when resource manager stopped before node managers, it always takes some
additional time until the process completely terminates. I tried to print
output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found following
{noformat}
16212 ? 00:00:11 java <defunct>
0
PID TTY TIME CMD
16212 ? 00:00:11 java <defunct>
0
PID TTY TIME CMD
16212 ? 00:00:11 java <defunct>
0
PID TTY TIME CMD
1
PID TTY TIME CMD
1
PID TTY TIME CMD
1
PID TTY TIME CMD
...
{noformat}
in the first 3 times of the loop, the process did not terminate so the exit
code of {{ps -p}} are still {{0}}
*Proposal of a fix*
Firstly I was thinking to add a
was:
Reproduce steps
# Setup a hadoop cluster
# Stop resource manager : yarn --daemon stop resourcemanager
# Stop node manager : yarn --daemon stop nodemanager
WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill
with kill -9
ERROR: Unable to kill 20325
it always gets "Unable to kill <nm_pid>" error message, this gives user
impression there is something wrong with the node manager process because it
was not able to be forcibly killed. But in fact, the kill command works as
expected.
This was because hadoop-functions.sh did not check process existence after kill
properly. Currently it checks the process liveness right after the kill command
{code}
...
kill -9 "${pid}" >/dev/null 2>&1
if ps -p "${pid}" > /dev/null 2>&1; then
hadoop_error "ERROR: Unable to kill ${pid}"
...
{code}
when resource manager stopped before node managers, it always takes some
additional time until the process completely terminates. I tried to print
output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found following
{noformat}
16212 ? 00:00:11 java <defunct>
0
PID TTY TIME CMD
16212 ? 00:00:11 java <defunct>
0
PID TTY TIME CMD
16212 ? 00:00:11 java <defunct>
0
PID TTY TIME CMD
1
PID TTY TIME CMD
1
PID TTY TIME CMD
1
PID TTY TIME CMD
...
{noformat}
in the first 3 times of the loop, the process did not terminate so the exit
code of {{ps -p}} are still {{0}}
> Process check bug in hadoop_stop_daemon of hadoop-functions.sh
> --------------------------------------------------------------
>
> Key: HADOOP-13837
> URL: https://issues.apache.org/jira/browse/HADOOP-13837
> Project: Hadoop Common
> Issue Type: Bug
> Components: scripts
> Reporter: Weiwei Yang
> Assignee: Weiwei Yang
> Attachments: HADOOP-13837.01.patch, HADOOP-13837.02.patch,
> HADOOP-13837.03.patch, HADOOP-13837.04.patch, check_proc.sh
>
>
> *Reproduce steps*
> # Setup a hadoop cluster
> # Stop resource manager : yarn --daemon stop resourcemanager
> # Stop node manager : yarn --daemon stop nodemanager
> WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill
> with kill -9
> ERROR: Unable to kill 20325
> it always gets "Unable to kill <nm_pid>" error message, this gives user
> impression there is something wrong with the node manager process because it
> was not able to be forcibly killed. But in fact, the kill command works as
> expected.
> This was because hadoop-functions.sh did not check process existence after
> kill properly. Currently it checks the process liveness right after the kill
> command
> {code}
> ...
> kill -9 "${pid}" >/dev/null 2>&1
> if ps -p "${pid}" > /dev/null 2>&1; then
> hadoop_error "ERROR: Unable to kill ${pid}"
> ...
> {code}
> when resource manager stopped before node managers, it always takes some
> additional time until the process completely terminates. I tried to print
> output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found
> following
> {noformat}
> 16212 ? 00:00:11 java <defunct>
> 0
> PID TTY TIME CMD
> 16212 ? 00:00:11 java <defunct>
> 0
> PID TTY TIME CMD
> 16212 ? 00:00:11 java <defunct>
> 0
> PID TTY TIME CMD
> 1
> PID TTY TIME CMD
> 1
> PID TTY TIME CMD
> 1
> PID TTY TIME CMD
> ...
> {noformat}
> in the first 3 times of the loop, the process did not terminate so the exit
> code of {{ps -p}} are still {{0}}
> *Proposal of a fix*
> Firstly I was thinking to add a
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]