[ 
https://issues.apache.org/jira/browse/HADOOP-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated HADOOP-13837:
---------------------------------
    Attachment: HADOOP-13837.05.patch

> Always get unable to kill error message even the hadoop process was 
> successfully killed
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13837
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: HADOOP-13837.01.patch, HADOOP-13837.02.patch, 
> HADOOP-13837.03.patch, HADOOP-13837.04.patch, HADOOP-13837.05.patch, 
> check_proc.sh
>
>
> *Reproduce steps*
> # Setup a hadoop cluster
> # Stop resource manager : yarn --daemon stop resourcemanager
> # Stop node manager : yarn --daemon stop nodemanager
> WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill 
> with kill -9
> ERROR: Unable to kill 20325
> it always gets "Unable to kill <nm_pid>" error message, this gives user 
> impression there is something wrong with the node manager process because it 
> was not able to be forcibly killed. But in fact, the kill command works as 
> expected.
> This was because hadoop-functions.sh did not check process existence after 
> kill properly. Currently it checks the process liveness right after the kill 
> command
> {code}
> ...
> kill -9 "${pid}" >/dev/null 2>&1
> if ps -p "${pid}" > /dev/null 2>&1; then
>       hadoop_error "ERROR: Unable to kill ${pid}"
> ...
> {code}
> when resource manager stopped before node managers, it always takes some 
> additional time until the process completely terminates. I tried to print 
> output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found 
> following
> {noformat}
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 16212 ?        00:00:11 java <defunct>
> 0
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> 1
>   PID TTY          TIME CMD
> ...
> {noformat}
> in the first 3 times of the loop, the process did not terminate so the exit 
> code of {{ps -p}} are still {{0}}
> *Proposal of a fix*
> Firstly I was thinking to add a more comprehensive pid check, it checks the 
> pid liveness until reaches the HADOOP_STOP_TIMEOUT, but this seems to add too 
> much complexity. Second fix was to simply add a {{sleep 3}} after {{kill 
> -9}}, it should fix the error in most cases with relative small changes to 
> the script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to