On Tue, May 22, 2007 at 05:26:04PM -0400, Brian Reichert wrote:
> I've been testing auto_failback in our 2.0.7-based lcuster, and
> have found sometimes failback doesn't occur.
> 
> We're managing a virtual IP via a haresources file on a Red Hat 4
> box.
> 
> What I tracked down was that if the box powered down too quickly
> for heartbeat to clean up, a PID file was left in place:
> 
>   # ls -ld /usr/local/var/run/heartbeat.pid
>   -rw-r-----  1 root root 11 May 22 16:44 /usr/local/var/run/heartbeat.pid
>   # cat /usr/local/var/run/heartbeat.pid
>       3215
> 
> But, when heartbeat tries to start after a reboot:
> 
>   May 22 16:46:41 sqe-50 heartbeat: [3214]: WARN: Logging daemon
>   is disabled --enabling logging daemon is recommended
>   May 22 16:46:41 sqe-50 heartbeat: [3214]: info: **************************
>   May 22 16:46:41 sqe-50 heartbeat: [3214]: info: Configuration
>   validated.  Starting heartbeat 2.0.7
>   May 22 16:46:41 sqe-50 heartbeat: [3214]: info: heartbeat: already
>   running [pid 3215].
> 
> What I see in make_daemon() is a check for this file, and it's contents:
> 
>         /* See if heartbeat is already running... */
> 
>         if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) {
>                 cl_log(LOG_INFO, "%s: already running [pid %ld]."
>                 ,       cmdname, pid);
>                 exit(LSB_EXIT_OK);
>         }
> 
> But, there's no check to assure the recorded PID is not stale.

I found a different place, where a different sort of check for
stale PIDs is failing for me, again after repeated reboots.

In lib/clplumbing/cl_pidfile.c::DoLock() 

We read in the contents of a [stale] pid file (in my case, it's
again heartbeat.pid), and make various tests.  If the stale pid
file describes:
- a valid-looking pid
- is not 'my' pid
- but now exists
we all come tumbling down.

   if (sscanf(buf, "%lu", &pid) < 1) {
           /* lockfile screwed up -> rm it and go on */
   } else {
           if (pid > 1 && (getpid() != pid)
           &&      ((CL_KILL((pid_t)pid, 0) >= 0)
           ||      errno != ESRCH)) {
                   /* tty is locked by existing (not
                    * necessarily running) process
                    * -> give up */
                   close(fd);
                   return -1;
           } else {
                   /* stale lockfile -> rm it and go on */
           }
    }

We're running as root (as far as as I can tell at this point), so
we don't need to test for ESRCH.  '0' is always a valid signal (so
no EINVAL), and we're root, so no EPERM.

I'm also confused by the comment; we're not talking about a TTY at
all...

Anyway, just like my patch from yesterday (which still hasn't come out of the
moderator's box yet): this is not a valid statement:

  If heartbeat.pid contains a pid, and that pid exists, it therefore
  is the pid if a running heartbeat process.

-- 
Brian Reichert                          <[EMAIL PROTECTED]>
55 Crystal Ave. #286                    Daytime number: (603) 434-6842
Derry NH 03038-1725 USA                 BSD admin/developer at large    
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to