On Tue, May 22, 2007 at 05:26:04PM -0400, Brian Reichert wrote:
> I've been testing auto_failback in our 2.0.7-based lcuster, and
> have found sometimes failback doesn't occur.
>
> We're managing a virtual IP via a haresources file on a Red Hat 4
> box.
>
> What I tracked down was that if the box powered down too quickly
> for heartbeat to clean up, a PID file was left in place:
>
> # ls -ld /usr/local/var/run/heartbeat.pid
> -rw-r----- 1 root root 11 May 22 16:44 /usr/local/var/run/heartbeat.pid
> # cat /usr/local/var/run/heartbeat.pid
> 3215
>
> But, when heartbeat tries to start after a reboot:
>
> May 22 16:46:41 sqe-50 heartbeat: [3214]: WARN: Logging daemon
> is disabled --enabling logging daemon is recommended
> May 22 16:46:41 sqe-50 heartbeat: [3214]: info: **************************
> May 22 16:46:41 sqe-50 heartbeat: [3214]: info: Configuration
> validated. Starting heartbeat 2.0.7
> May 22 16:46:41 sqe-50 heartbeat: [3214]: info: heartbeat: already
> running [pid 3215].
>
> What I see in make_daemon() is a check for this file, and it's contents:
>
> /* See if heartbeat is already running... */
>
> if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) {
> cl_log(LOG_INFO, "%s: already running [pid %ld]."
> , cmdname, pid);
> exit(LSB_EXIT_OK);
> }
>
> But, there's no check to assure the recorded PID is not stale.
I found a different place, where a different sort of check for
stale PIDs is failing for me, again after repeated reboots.
In lib/clplumbing/cl_pidfile.c::DoLock()
We read in the contents of a [stale] pid file (in my case, it's
again heartbeat.pid), and make various tests. If the stale pid
file describes:
- a valid-looking pid
- is not 'my' pid
- but now exists
we all come tumbling down.
if (sscanf(buf, "%lu", &pid) < 1) {
/* lockfile screwed up -> rm it and go on */
} else {
if (pid > 1 && (getpid() != pid)
&& ((CL_KILL((pid_t)pid, 0) >= 0)
|| errno != ESRCH)) {
/* tty is locked by existing (not
* necessarily running) process
* -> give up */
close(fd);
return -1;
} else {
/* stale lockfile -> rm it and go on */
}
}
We're running as root (as far as as I can tell at this point), so
we don't need to test for ESRCH. '0' is always a valid signal (so
no EINVAL), and we're root, so no EPERM.
I'm also confused by the comment; we're not talking about a TTY at
all...
Anyway, just like my patch from yesterday (which still hasn't come out of the
moderator's box yet): this is not a valid statement:
If heartbeat.pid contains a pid, and that pid exists, it therefore
is the pid if a running heartbeat process.
--
Brian Reichert <[EMAIL PROTECTED]>
55 Crystal Ave. #286 Daytime number: (603) 434-6842
Derry NH 03038-1725 USA BSD admin/developer at large
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems