I've been testing auto_failback in our 2.0.7-based lcuster, and
have found sometimes failback doesn't occur.

We're managing a virtual IP via a haresources file on a Red Hat 4
box.

What I tracked down was that if the box powered down too quickly
for heartbeat to clean up, a PID file was left in place:

  # ls -ld /usr/local/var/run/heartbeat.pid
  -rw-r-----  1 root root 11 May 22 16:44 /usr/local/var/run/heartbeat.pid
  # cat /usr/local/var/run/heartbeat.pid
      3215

But, when heartbeat tries to start after a reboot:

  May 22 16:46:41 sqe-50 heartbeat: [3214]: WARN: Logging daemon
  is disabled --enabling logging daemon is recommended
  May 22 16:46:41 sqe-50 heartbeat: [3214]: info: **************************
  May 22 16:46:41 sqe-50 heartbeat: [3214]: info: Configuration
  validated.  Starting heartbeat 2.0.7
  May 22 16:46:41 sqe-50 heartbeat: [3214]: info: heartbeat: already
  running [pid 3215].

What I see in make_daemon() is a check for this file, and it's contents:

        /* See if heartbeat is already running... */

        if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) {
                cl_log(LOG_INFO, "%s: already running [pid %ld]."
                ,       cmdname, pid);
                exit(LSB_EXIT_OK);
        }

But, there's no check to assure the recorded PID is not stale.

Have others seen this?  This code seems to be in 2.0.8 as well...

-- 
Brian Reichert                          <[EMAIL PROTECTED]>
55 Crystal Ave. #286                    Daytime number: (603) 434-6842
Derry NH 03038-1725 USA                 BSD admin/developer at large    
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to