[ 
https://issues.apache.org/jira/browse/HADOOP-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159228#comment-16159228
 ] 

Allen Wittenauer commented on HADOOP-14855:
-------------------------------------------

This is a dupe of HADOOP-9085 (and it's buddy HADOOP-9086).

[[email protected]]'s comments are spot on, with this being the key one:

bq. What we need to do is move away from pid-file-liveness tests altogether.

Unfortunately, we're using Java.   Doing liveliness checks anywhere but in bash 
is either extremely expensive due to the massive classpath or 
non-portable/introduces more environmental dependencies.

Other thoughts:

1) These types of pid clashes are more on the edge case/rare side.  They just 
generally aren't worth spending the effort on.

2) Given user-functions and shell profiles, it's possible for end users (or 
vendors) to replace the pid checking/handling on their own. I'm expecting 
experienced admins to replace it with daemontools and the like.

> Hadoop scripts may errantly believe a daemon is still running, preventing it 
> from starting
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-14855
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14855
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: scripts
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Aaron T. Myers
>
> I encountered a case recently where the NN wouldn't start, with the error 
> message "namenode is running as process 16769.  Stop it first." In fact the 
> NN was not running at all, but rather another long-running process was 
> running with this pid.
> It looks to me like our scripts just check to see if _any_ process is running 
> with the pid that the NN (or any Hadoop daemon) most recently ran with. This 
> is clearly not a fool-proof way of checking to see if a particular type of 
> daemon is now running, as some other process could start running with the 
> same pid since the daemon in question was previously shut down.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to