On Tue, Mar 30, 2010 at 6:48 PM, Steven Dake <[email protected]> wrote: > On Tue, 2010-03-30 at 11:43 +0200, Colin wrote: >> >> we are running Corosync 1.2.0-0ubuntu1 on Ubuntu 10.4 beta w/current >> updates; the cluster consists of two systems running in KVM, each on a >> dedicated host. >> >> We have observed several times, but are unfortunately unable to nail >> the exact cause, that when the virtualised system that is running >> corosync has a "hiccup", i.e. hangs for couple of seconds when we >> introduce a delay into its storage access, then the corosync process >> enters an endless loop from which it doesn't ever seem to recover. >> >> In this endless loop the process uses 193% CPU in the 2-CPU >> virtualised system, and is issuing a stream of wait4() system-calls >> (with an occasional nanosleep() and some futex-stuff). > > If you could explain how you delay your vm for a short period, I could > debug. > > One thing you can try is to increase your token timeout (the token field > in the totem{} directive). At the moment it is set to 1000 msec (1 > second), but that may not be suitable for some virtualized environments.
Typical Murphy's law, now that I'm trying to get a stack-trace I'm having problems reproducing the problem (after it having occurred several times previously, perhaps it's because I installed the debug build?) ... anyhow, I'm not worried about the cluster losing connection when the VM hiccups; it's the fact that it doesn't recover afterwards because the corosync-process is in an endless loop that is disconcerting. Regards, Colin _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
