I don't think so has, I do have over similar cluster on the same network and didn't have any issues. The only thing I can detect was that the virtual machine was like unresponsive. But I think the VM crash was not like a power shutdown more like very slow then totaly crash.
Even if the drbd-nagios resource timeout, it should failover on the other node no ? Regards, On 20 February 2012 12:35, Andrew Beekhof <and...@beekhof.net> wrote: > On Mon, Feb 13, 2012 at 9:57 PM, Hugo Deprez <hugo.dep...@gmail.com> > wrote: > > Hello, > > > > does anyone have an idea ? > > Well I see: > > Feb 8 12:59:05 server01 crmd: [19470]: ERROR: process_lrm_event: LRM > operation drbd-nagios:1_monitor_15000 (90) Timed Out (timeout=20000ms) > Feb 8 13:00:05 server01 crmd: [19470]: WARN: cib_rsc_callback: > Resource update 415 failed: (rc=-41) Remote node did not respond > Feb 8 13:06:36 server01 crmd: [19470]: notice: ais_dispatch: > Membership 128: quorum lost > > which looks suspicious. Network problem? > > > > > it seems that at 13:06:38 resources et started on slave member. > > But then there is something wrong on server01 : > > > > Feb 8 13:06:39 server01 pengine: [19469]: info: determine_online_status: > > Node server01 is online > > Feb 8 13:06:39 server01 pengine: [19469]: notice: unpack_rsc_op: > Operation > > apache2_monitor_0 found resource apache2 active on server01 > > Feb 8 13:06:39 server01 pengine: [19469]: notice: group_print: Resource > > Group: supervision-grp > > Feb 8 13:06:39 server01 pengine: [19469]: notice: native_print: > > fs-data (ocf::heartbeat:Filesystem): Stopped > > Feb 8 13:06:39 server01 pengine: [19469]: notice: native_print: > > nagios-ip (ocf::heartbeat:IPaddr2): Stopped > > Feb 8 13:06:39 server01 pengine: [19469]: notice: native_print: > > apache2 (ocf::heartbeat:apache): Started server01 > > Feb 8 13:06:39 server01 pengine: [19469]: notice: native_print: > > nagios (lsb:nagios3): Stopped > > > > > > But I don't understand what fails if this is DRBD or apache2 causes the > > issue. > > > > Any idea ? > > > > > > > > On 10 February 2012 09:39, Hugo Deprez <hugo.dep...@gmail.com> wrote: > >> > >> Hello, > >> > >> please found attach to this mail the corosync logs. > >> If you have any tips :) > >> > >> > >> > >> Regards, > >> > >> Hugo > >> > >> > >> On 8 February 2012 15:39, Florian Haas <flor...@hastexo.com> wrote: > >>> > >>> On Wed, Feb 8, 2012 at 2:29 PM, Hugo Deprez <hugo.dep...@gmail.com> > >>> wrote: > >>> > Dear community, > >>> > > >>> > I am currently running different corosync / drbd cluster using VM > >>> > running on > >>> > vmware esxi host. > >>> > Guest Os are Debian Squeeze. > >>> > > >>> > the active member of the cluster just freeze the VM was unreachable. > >>> > But the resources didn't achieved to move to the other node. > >>> > > >>> > My cluster has the following ressources : > >>> > > >>> > Resource Group: grp > >>> > fs-data (ocf::heartbeat:Filesystem): > >>> > nagios-ip (ocf::heartbeat:IPaddr2): > >>> > apache2 (ocf::heartbeat:apache): > >>> > nagios (lsb:nagios3): > >>> > pnp (lsb:npcd): > >>> > > >>> > > >>> > I am currently troubleshooting this issue. I don't really know where > to > >>> > look. Of course I had a look at the logs, but it is pretty hard for > me > >>> > to > >>> > understand what happen. > >>> > >>> It's pretty hard for anyone else to understand _without_ logs. :) > >>> > >>> > I noticed that the VM crash at 12:09 and that the cluster only try to > >>> > move > >>> > the ressources at 12:58, this does not make sens for me. Or maybe > the > >>> > host > >>> > wasn't totaly down ? > >>> > > >>> > Do you have any idea how I can troubleshoot ? > >>> > >>> Log analysis is where I would start. > >>> > >>> > Last thing, I notice that If I start apache2 on the slave server, > >>> > corosync > >>> > didn't detect that the resource is started, could that be an issue ? > >>> > >>> Sure it could, but Pacemaker should happily recover from that. > >>> > >>> Cheers, > >>> Florian > >>> > >>> -- > >>> Need help with High Availability? > >>> http://www.hastexo.com/now > >>> > >>> _______________________________________________ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >> > >> > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org