Re-sending, this time without the large attachements that blocked it's
posting.  I can provide those straces, if folks want.

----- Forwarded message from Brian Reichert <[EMAIL PROTECTED]> -----

Date: Wed, 23 May 2007 15:24:45 -0400
From: Brian Reichert <[EMAIL PROTECTED]>
To: General Linux-HA mailing list <[email protected]>
Subject: Re: [Linux-HA] issue with management of heartbeat.pid file

On Tue, May 22, 2007 at 05:26:04PM -0400, Brian Reichert wrote:
> What I see in make_daemon() is a check for this file, and it's contents:
> 
>         /* See if heartbeat is already running... */
> 
>         if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) {
>                 cl_log(LOG_INFO, "%s: already running [pid %ld]."
>                 ,       cmdname, pid);
>                 exit(LSB_EXIT_OK);
>         }
> 
> But, there's no check to assure the recorded PID is not stale.

Actually, the issue seems to be more complex:

- If I reboot my node repeatedly, about every sixth time, I run
  into the condition that I initially reported.  It otherwise behaves.

- I've tried on several different Dell server types, and there seems
  to be some hint that faster the hardware is, the harder it is to
  reproduce.  I'm currently testing on a Dell 1850.

Misc observations:

- Even if I nicely shut down heartbeat from the command line:

   /etc/init.d/heartbeat stop

  I never see the heartbeat file cleaned up, even though there seems
  to be code for it.

- Elsewhere in the code, folks are using kill(pid,0) to test if the
  PID exists, but, as we're running as root at this point, and
  there's no guarantee that the PID will not get reused by some
  other process (esp after a reboot), a result of 0 doesn't mean
  much.  I'm concerned a more rigorous test needs to be put in
  place.

  (Heartbeat seems to kick off a handful of child processes, I'm
  now perusing them to see that they're watch-dogged somehow...)

  (I don't know what the relationship of these processes are, nor
  how they interact.  I wonder if it's feasible [or meaningful] to
  break them out into separate processes that could be watch dogged
  by daemon tools, for example...)

I've attached to 'strace' outputs, both generated like this:

  strace -Ff -v -s 256 /etc/init.d/heartbeat start >& /var/tmp/out

  These two traces:  one under 'working' circumstances, and one
  where my symptom is exhibited.  These were generated with 2.0.7,
  but my code research has been with 2.0.8; see below.  This is all under
  Redhat 4 Update 4, kernel 2.6.9-42.0.10.ELsmp.

This is the patch to 2.0.8 that I'm testing:

# diff -U3 heartbeat/heartbeat.c.orig heartbeat/heartbeat.c
--- heartbeat/heartbeat.c.orig  2007-01-11 21:57:05.000000000 -0500
+++ heartbeat/heartbeat.c       2007-05-23 12:27:38.000000000 -0400
@@ -4934,7 +4934,7 @@
 
        /* See if heartbeat is already running... */
 
-       if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) {
+       if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid() && 
CL_KILL(pid,0)) {
                cl_log(LOG_INFO, "%s: already running [pid %ld]."
                ,       cmdname, pid);
                exit(LSB_EXIT_OK);

I'll let people know if this clears up what I'm seeing.  But I'd
appreciate any feedback from developers.  I'm especially curious
about the multiple checks in the code for the validity of heartbeat.pid's
contents...
-- 
Brian Reichert                          <[EMAIL PROTECTED]>
55 Crystal Ave. #286                    Daytime number: (603) 434-6842
Derry NH 03038-1725 USA                 BSD admin/developer at large    




----- End forwarded message -----

-- 
Brian Reichert                          <[EMAIL PROTECTED]>
55 Crystal Ave. #286                    Daytime number: (603) 434-6842
Derry NH 03038-1725 USA                 BSD admin/developer at large    
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to