[ https://issues.apache.org/jira/browse/HADOOP-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weiwei Yang updated HADOOP-13837: --------------------------------- Priority: Critical (was: Major) > Always get unable to kill error message even the hadoop process was > successfully killed > --------------------------------------------------------------------------------------- > > Key: HADOOP-13837 > URL: https://issues.apache.org/jira/browse/HADOOP-13837 > Project: Hadoop Common > Issue Type: Bug > Components: scripts > Reporter: Weiwei Yang > Assignee: Weiwei Yang > Priority: Critical > Attachments: HADOOP-13837.01.patch, HADOOP-13837.02.patch, > HADOOP-13837.03.patch, HADOOP-13837.04.patch, check_proc.sh > > > *Reproduce steps* > # Setup a hadoop cluster > # Stop resource manager : yarn --daemon stop resourcemanager > # Stop node manager : yarn --daemon stop nodemanager > WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill > with kill -9 > ERROR: Unable to kill 20325 > it always gets "Unable to kill <nm_pid>" error message, this gives user > impression there is something wrong with the node manager process because it > was not able to be forcibly killed. But in fact, the kill command works as > expected. > This was because hadoop-functions.sh did not check process existence after > kill properly. Currently it checks the process liveness right after the kill > command > {code} > ... > kill -9 "${pid}" >/dev/null 2>&1 > if ps -p "${pid}" > /dev/null 2>&1; then > hadoop_error "ERROR: Unable to kill ${pid}" > ... > {code} > when resource manager stopped before node managers, it always takes some > additional time until the process completely terminates. I tried to print > output of {{ps -p <nm_pid>}} in a while loop after kill -9, and found > following > {noformat} > 16212 ? 00:00:11 java <defunct> > 0 > PID TTY TIME CMD > 16212 ? 00:00:11 java <defunct> > 0 > PID TTY TIME CMD > 16212 ? 00:00:11 java <defunct> > 0 > PID TTY TIME CMD > 1 > PID TTY TIME CMD > 1 > PID TTY TIME CMD > 1 > PID TTY TIME CMD > ... > {noformat} > in the first 3 times of the loop, the process did not terminate so the exit > code of {{ps -p}} are still {{0}} > *Proposal of a fix* > Firstly I was thinking to add a more comprehensive pid check, it checks the > pid liveness until reaches the HADOOP_STOP_TIMEOUT, but this seems to add too > much complexity. Second fix was to simply add a {{sleep 3}} after {{kill > -9}}, it should fix the error in most cases with relative small changes to > the script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org