[ 
https://issues.apache.org/jira/browse/HADOOP-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159295#comment-16159295
 ] 

Aaron T. Myers commented on HADOOP-14855:
-----------------------------------------

Actually, [~aw], here's a lightweight suggestion to make this check at least 
much more robust, if not quite foolproof. The current code that does this just 
checks to see if a process is running with the pid in question. But we also 
know the name of the daemon we're checking on, so couldn't we pretty easily 
make this check more robust by also grepping for the name of the daemon in the 
{{`ps'}} output for the pid in question? That would take an already rare issue 
and make it _exceptionally_ unlikely to result in a false positive, and without 
adding any additional dependencies beyond grep. 

Specifically, I'm thinking replace this line:

{code}
if ps -p "${pid}" > /dev/null 2>&1; then
{code}

With something like this:

{code}
if ps -fp "${pid}" | grep "${daemonname}" > /dev/null 2>&1; then
{code}

Total shell scripting newbie here, so please feel free to tell me that this is 
way off base.

> Hadoop scripts may errantly believe a daemon is still running, preventing it 
> from starting
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-14855
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14855
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Aaron T. Myers
>
> I encountered a case recently where the NN wouldn't start, with the error 
> message "namenode is running as process 16769.  Stop it first." In fact the 
> NN was not running at all, but rather another long-running process was 
> running with this pid.
> It looks to me like our scripts just check to see if _any_ process is running 
> with the pid that the NN (or any Hadoop daemon) most recently ran with. This 
> is clearly not a fool-proof way of checking to see if a particular type of 
> daemon is now running, as some other process could start running with the 
> same pid since the daemon in question was previously shut down.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to