On Fri, Oct 10, 2008 at 05:55:02PM +0200, Stefan Ott wrote:
> Simon Horman wrote:
>> On Fri, Oct 10, 2008 at 09:40:11AM +1100, Simon Horman wrote:
>>> Hi Stefan,
>>>
>>> I think that there is a silly parsing bug. Can you please try:
>>>
>>> checkcommand = /usr/local/sbin/check_lustre_on_realserver
>>>
>>> Instead of
>>>
>>> checkcommand = "/usr/local/sbin/check_lustre_on_realserver"
>>
>> Hi Stefan,
>>
>> could you try the following patch to see if it solves your
>> problem without needing to update the configuration file?
>>
>> Thanks
>>
>
> Hi Simon
>
> I tried both (removing the quotes and applying your patch), none of
> which helped. Any other ideas?
Hi Stefan,
sorry for taking a while to look into this. You do need to make
the change above or apply the patch above. But there is also another
change needed.
The signal handling has been set up to auto-reap children,
which is needed for the case where they time out and thus
aren't reaped by the waitpid() that is called inside system().
However due to the wonders of perl, setting autoreap actually
changes the return value of waitpid() from > 0 to -1. This
breaks system() and is the root cause of the problem you are seeing.
My proposed fix is below. It sets up a signal handler that
just reaps the childern as neccessary. And it does this globally,
as there seems to be no good reason not to.
An alternative, which is just a work-around, is to simply remove
the "local $SIG{CHLD} = 'IGNORE';" which appears around line 2290.
This, however, will lead to zombies of your check process ever
times out.
--
Simon Horman
VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
H: www.vergenet.net/~horms/ W: www.valinux.co.jp/en
Index: lha-dev/ldirectord/ldirectord.in
===================================================================
--- lha-dev.orig/ldirectord/ldirectord.in 2008-10-15 17:43:19.000000000
+1100
+++ lha-dev/ldirectord/ldirectord.in 2008-10-15 17:55:16.000000000 +1100
@@ -645,6 +645,7 @@ use vars qw(
$DAEMON_STATUS_ALL
$DAEMON_TERM
$DAEMON_HUP
+ $DAEMON_CHLD
$opt_d
$opt_h
$stattime
@@ -800,6 +801,9 @@ sub ld_init
# HUP is actually used
$SIG{'HUP'} = \&ld_handler_hup;
+ # Reap Children
+ $SIG{'CHLD'} = \&ld_handler_chld;
+
if (defined $ENV{HOSTNAME}) {
$HOSTNAME = "$ENV{HOSTNAME}";
}
@@ -977,6 +981,21 @@ sub ld_process_hup
&reread_config();
}
+sub ld_handler_chld
+{
+ $DAEMON_CHLD=1;
+}
+
+sub ld_process_chld
+{
+ my $i = 0;
+
+ undef $DAEMON_CHLD;
+ while (waitpid(-1, WNOHANG) > 0) {
+ print "child: $i\n";
+ $i++;
+ }
+}
sub check_signal
{
@@ -986,6 +1005,9 @@ sub check_signal
if (defined $DAEMON_HUP) {
ld_process_hup();
}
+ if (defined $DAEMON_CHLD) {
+ ld_process_chld();
+ }
}
sub reread_config
@@ -2288,7 +2310,6 @@ sub ld_main
check_signal();
} else {
- local $SIG{CHLD} = 'IGNORE';
my @real_checked;
foreach my $v (@VIRTUAL) {
my $real = $$v{real};
@@ -2947,6 +2968,7 @@ sub check_external
my ($v, $r) = @_;
my $result;
my $v_server;
+ my $timed_out = 1;
eval {
local $SIG{'__DIE__'} = "DEFAULT";
@@ -2958,8 +2980,8 @@ sub check_external
} else {
$v_server = $$v{fwm};
}
- $result = system("$$v{checkcommand} $v_server $$v{port} " .
- "$$r{server} $$r{port}");
+ $result = system_wrapper($$v{checkcommand}, $v_server,
+ $$v{port}, $$r{server}, $$r{port});
alarm 0;
$result >>= 8;
};
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems