[jira] [Commented] (HADOOP-14855) Hadoop scripts may errantly believe a daemon is still running, preventing it from starting

Allen Wittenauer (JIRA) Fri, 08 Sep 2017 14:59:43 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159435#comment-16159435
 ]


Allen Wittenauer commented on HADOOP-14855:
-------------------------------------------

(I'm having a total deja vu moment right now.  I wish I could remember who else 
I discussed this issue with a few years ago. haha.)

It reduces the size of the edge case from 0.5% to 0.1% (or whatever). It'll 
still match things like 'cat datanode.txt'.  Execution speed wise, though, it's 
pretty expensive when one considers that we've doubled the # of forks for every 
start/status/stop request.  That'll have an impact esp in places like QA.

But giving some further thought to it... I think you're on to something that 
might work pretty well... hmm...

off the top:
{code}
pspid=$(ps -fp "${pid}" 2>/dev/null)

if [[ $? -ne 0]]; then
  if [[ ${pspid} =~ Dproc_${daemonname} ]]; then
{code}

or whatever.  [e.g., that $? construction has issues.]

I think that'd be nearly the same cost as we have now and doesn't make the 
edge-case situation more expensive.  It also avoids the IO that's very tempting 
by writing the ps output to a temp file. The 'grep' is replaced by an internal 
regex check and lsince 3.x consistently defines proc_ for jps usage we can 
bounce off of that to reduce the search space even more.

It's still not foolproof, but it does cut down the chances of false positives.  
It's just a matter of if it's worth it or not.

BTW, there are some other patches out there regarding this code but I haven't 
had a chance to really play with the edge cases. (and there are a lot.)

> Hadoop scripts may errantly believe a daemon is still running, preventing it 
> from starting
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-14855
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14855
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Aaron T. Myers
>
> I encountered a case recently where the NN wouldn't start, with the error 
> message "namenode is running as process 16769.  Stop it first." In fact the 
> NN was not running at all, but rather another long-running process was 
> running with this pid.
> It looks to me like our scripts just check to see if _any_ process is running 
> with the pid that the NN (or any Hadoop daemon) most recently ran with. This 
> is clearly not a fool-proof way of checking to see if a particular type of 
> daemon is now running, as some other process could start running with the 
> same pid since the daemon in question was previously shut down.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-14855) Hadoop scripts may errantly believe a daemon is still running, preventing it from starting

Reply via email to