Re-sending, this time without the large attachements that blocked it's posting. I can provide those straces, if folks want.
----- Forwarded message from Brian Reichert <[EMAIL PROTECTED]> ----- Date: Wed, 23 May 2007 15:24:45 -0400 From: Brian Reichert <[EMAIL PROTECTED]> To: General Linux-HA mailing list <[email protected]> Subject: Re: [Linux-HA] issue with management of heartbeat.pid file On Tue, May 22, 2007 at 05:26:04PM -0400, Brian Reichert wrote: > What I see in make_daemon() is a check for this file, and it's contents: > > /* See if heartbeat is already running... */ > > if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) { > cl_log(LOG_INFO, "%s: already running [pid %ld]." > , cmdname, pid); > exit(LSB_EXIT_OK); > } > > But, there's no check to assure the recorded PID is not stale. Actually, the issue seems to be more complex: - If I reboot my node repeatedly, about every sixth time, I run into the condition that I initially reported. It otherwise behaves. - I've tried on several different Dell server types, and there seems to be some hint that faster the hardware is, the harder it is to reproduce. I'm currently testing on a Dell 1850. Misc observations: - Even if I nicely shut down heartbeat from the command line: /etc/init.d/heartbeat stop I never see the heartbeat file cleaned up, even though there seems to be code for it. - Elsewhere in the code, folks are using kill(pid,0) to test if the PID exists, but, as we're running as root at this point, and there's no guarantee that the PID will not get reused by some other process (esp after a reboot), a result of 0 doesn't mean much. I'm concerned a more rigorous test needs to be put in place. (Heartbeat seems to kick off a handful of child processes, I'm now perusing them to see that they're watch-dogged somehow...) (I don't know what the relationship of these processes are, nor how they interact. I wonder if it's feasible [or meaningful] to break them out into separate processes that could be watch dogged by daemon tools, for example...) I've attached to 'strace' outputs, both generated like this: strace -Ff -v -s 256 /etc/init.d/heartbeat start >& /var/tmp/out These two traces: one under 'working' circumstances, and one where my symptom is exhibited. These were generated with 2.0.7, but my code research has been with 2.0.8; see below. This is all under Redhat 4 Update 4, kernel 2.6.9-42.0.10.ELsmp. This is the patch to 2.0.8 that I'm testing: # diff -U3 heartbeat/heartbeat.c.orig heartbeat/heartbeat.c --- heartbeat/heartbeat.c.orig 2007-01-11 21:57:05.000000000 -0500 +++ heartbeat/heartbeat.c 2007-05-23 12:27:38.000000000 -0400 @@ -4934,7 +4934,7 @@ /* See if heartbeat is already running... */ - if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid()) { + if ((pid=cl_read_pidfile(PIDFILE)) > 0 && pid != getpid() && CL_KILL(pid,0)) { cl_log(LOG_INFO, "%s: already running [pid %ld]." , cmdname, pid); exit(LSB_EXIT_OK); I'll let people know if this clears up what I'm seeing. But I'd appreciate any feedback from developers. I'm especially curious about the multiple checks in the code for the validity of heartbeat.pid's contents... -- Brian Reichert <[EMAIL PROTECTED]> 55 Crystal Ave. #286 Daytime number: (603) 434-6842 Derry NH 03038-1725 USA BSD admin/developer at large ----- End forwarded message ----- -- Brian Reichert <[EMAIL PROTECTED]> 55 Crystal Ave. #286 Daytime number: (603) 434-6842 Derry NH 03038-1725 USA BSD admin/developer at large _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
