Hello,

On Thu, Aug 09, 2007 at 04:34:58PM +0200, Andreas Kurz wrote:
> Hello all,
> 
> I've tested an upgrade from Heartbeat 2.1.0 to 2.1.2 on CentOS 4.5
> with a two-node cluster. With version 2.1.2 a lot of messages like the
> following occured in the logs:
> 
> lrmd[17208]: 2007/08/02_01:57:15 WARN: G_SIG_dispatch: Dispatch
> function for SIGCHLD was delayed 1010 ms (> 100 ms) be fore being
> called (GSource: 0x9ba62d8)
> ....
> lrmd[17208]: 2007/08/04_23:41:29 WARN: G_SIG_dispatch: Dispatch
> function for SIGCHLD took too long to execute: 50 ms (> 30 ms)
> (GSource: 0x9ba62d8)

Still can't say why this is happening. The regularity of the
times is strange as well as the 1s delay for the dispatch.
Perhaps Alan can give a hint. Please file a bugzilla for this.

> I killed the 'lrmd' process on this test system on one node (not the
> DC) and it was respawned as expected and the log messages disappeared
> on this node (attached logs from 2007-08-06). After a while I realized
> that the local resource monitoring was not working.

The messages didn't occur anymore, because lrmd had no resources
and therefore there were no children processes.

> I executed a
> 'crm_resource -P' and messages like this appeared in the DC logs
> (attached logs from 2007-08-08):
> 
>  tengine[10535]: 2007/08/08_10:17:17 WARN: action_timer_callback:
> Timer popped (abort_level=1000000, complete=false)
> tengine[10535]: 2007/08/08_10:17:17 WARN: print_elem: Action missed
> its  timeout[Action 11]: In-flight (id: DoFencing:0_monitor_0, loc:
> smsdb05, priority: 0)
> ...
> tengine[10535]: 2007/08/08_10:18:43 info: unconfirmed_actions: Action
> DoFencing:0_monitor_0 11 unconfirmed from peer
> 
> The only way to resolve this issue was a restart of heartbeat on the
> node with the stuck lrmd and a simple "/etc/init.d/heartbeat stop" was
> running more than 20 minutes without success. Has someone an idea
> whats wrong?

Restarting lrmd should result in a restart of crmd, but that
didn't happen. crmd on the node smsdb05 continued to run and,
judging by the logs on smsdb06, even accept commands from the DC,
but never run the requested operations. That's why all actions
were unconfirmed: cmrd on smsdb05 was in limbo. Can't say why.
Perhaps Andrew could, though I didn't see any messages in the
logs. I suspect that that's also the reason for failed shutdown.
This should also get a bugzilla entry.

BTW, if you think that lrmd "gets stuck", you can list the
resources and their status using lrmadmin -L.

The pingd clone should have the globally_unique attribute set to
false.

You should also enable coredumps in ha.cf. And, if this is a test
cluster, turn the debugging on so that we can get better idea of
what's going on.

Thanks for the report.

Dejan

> Regards,
> Andreas







> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to