Simon Horman wrote:
On Fri, Oct 10, 2008 at 05:55:02PM +0200, Stefan Ott wrote:
Simon Horman wrote:
On Fri, Oct 10, 2008 at 09:40:11AM +1100, Simon Horman wrote:
Hi Stefan,

I think that there is a silly parsing bug. Can you please try:

        checkcommand = /usr/local/sbin/check_lustre_on_realserver

Instead of

        checkcommand = "/usr/local/sbin/check_lustre_on_realserver"
Hi Stefan,

could you try the following patch to see if it solves your
problem without needing to update the configuration file?

Thanks

Hi Simon

I tried both (removing the quotes and applying your patch), none of which helped. Any other ideas?

Hi Stefan,

sorry for taking a while to look into this. You do need to make
the change above or apply the patch above. But there is also another
change needed.

The signal handling has been set up to auto-reap children,
which is needed for the case where they time out and thus
aren't reaped by the waitpid() that is called inside system().

However due to the wonders of perl, setting autoreap actually
changes the return value of waitpid() from > 0 to -1. This
breaks system() and is the root cause of the problem you are seeing.

My proposed fix is below. It sets up a signal handler that
just reaps the childern as neccessary. And it does this globally,
as there seems to be no good reason not to.

An alternative, which is just a work-around, is to simply remove
the "local $SIG{CHLD} = 'IGNORE';" which appears around line 2290.
This, however, will lead to zombies of your check process ever
times out.

Thanks, this patch seems to work!

cheers
--
Stefan Ott
Zentrale Systeme Universität Bern
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to