[
https://issues.apache.org/jira/browse/HADOOP-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794334#comment-15794334
]
Weiwei Yang commented on HADOOP-13837:
--------------------------------------
Hi [~aw]
I have updated the title and the description of this JIRA, hopefully it is
describing the issue and the fix better. Please help to review the v4 patch.
Thank you.
> Always get unable to kill error message even the hadoop process was
> successfully killed
> ---------------------------------------------------------------------------------------
>
> Key: HADOOP-13837
> URL: https://issues.apache.org/jira/browse/HADOOP-13837
> Project: Hadoop Common
> Issue Type: Bug
> Components: scripts
> Reporter: Weiwei Yang
> Assignee: Weiwei Yang
> Attachments: HADOOP-13837.01.patch, HADOOP-13837.02.patch,
> HADOOP-13837.03.patch, HADOOP-13837.04.patch, check_proc.sh
>
>
> *Reproduce steps*
> # Setup a hadoop cluster
> # Stop resource manager : yarn --daemon stop resourcemanager
> # Stop node manager : yarn --daemon stop nodemanager
> WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill
> with kill -9
> ERROR: Unable to kill 20325
> it always gets "Unable to kill <nm_pid>" error message, this gives user
> impression there is something wrong with the node manager process because it
> was not able to be forcibly killed. But in fact, the kill command works as
> expected.
> This was because hadoop-functions.sh did not check process existence after
> kill properly. Currently it checks the process liveness right after the kill
> command
> {code}
> ...
> kill -9 "${pid}" >/dev/null 2>&1
> if ps -p "${pid}" > /dev/null 2>&1; then
> hadoop_error "ERROR: Unable to kill ${pid}"
> ...
> {code}
> when resource manager stopped before node managers, it always takes some
> additional time until the process completely terminates. I tried to print
> output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found
> following
> {noformat}
> 16212 ? 00:00:11 java <defunct>
> 0
> PID TTY TIME CMD
> 16212 ? 00:00:11 java <defunct>
> 0
> PID TTY TIME CMD
> 16212 ? 00:00:11 java <defunct>
> 0
> PID TTY TIME CMD
> 1
> PID TTY TIME CMD
> 1
> PID TTY TIME CMD
> 1
> PID TTY TIME CMD
> ...
> {noformat}
> in the first 3 times of the loop, the process did not terminate so the exit
> code of {{ps -p}} are still {{0}}
> *Proposal of a fix*
> Firstly I was thinking to add a more comprehensive pid check, it checks the
> pid liveness until reaches the HADOOP_STOP_TIMEOUT, but this seems to add too
> much complexity. Second fix was to simply add a {{sleep 3}} after {{kill
> -9}}, it should fix the error in most cases with relative small changes to
> the script.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]