On Tue, Jul 24, 2012 at 11:13 PM, <fatcha...@gmx.de> wrote: > Hi, > > here are the results of the corosync status. Can´t find a problem there: > > pilotpound: > > [root@pilotpound ~]# corosync-cfgtool -s > Printing ring status. > Local node ID 425699520 > RING ID 0 > id = 192.168.95.25 > status = ring 0 active with no faults > RING ID 1 > id = 192.168.20.245 > status = ring 1 active with no faults > [root@pilotpound ~]# corosync-objctl | grep member > runtime.totem.pg.mrp.srp.members.425699520.ip=r(0) ip(192.168.95.25) r(1) > ip(192.168.20.245) > runtime.totem.pg.mrp.srp.members.425699520.join_count=1 > runtime.totem.pg.mrp.srp.members.425699520.status=joined > runtime.totem.pg.mrp.srp.members.442476736.ip=r(0) ip(192.168.95.26) r(1) > ip(192.168.20.246) > runtime.totem.pg.mrp.srp.members.442476736.join_count=1 > runtime.totem.pg.mrp.srp.members.442476736.status=joined > > > powerpound: > > [root@powerpound ~]# corosync-cfgtool -s > Printing ring status. > Local node ID 442476736 > RING ID 0 > id = 192.168.95.26 > status = ring 0 active with no faults > RING ID 1 > id = 192.168.20.246 > status = ring 1 active with no faults > [root@powerpound ~]# corosync-objctl | grep member > runtime.totem.pg.mrp.srp.members.442476736.ip=r(0) ip(192.168.95.26) r(1) > ip(192.168.20.246) > runtime.totem.pg.mrp.srp.members.442476736.join_count=1 > runtime.totem.pg.mrp.srp.members.442476736.status=joined > runtime.totem.pg.mrp.srp.members.425699520.ip=r(0) ip(192.168.95.25) r(1) > ip(192.168.20.245) > runtime.totem.pg.mrp.srp.members.425699520.join_count=5 > runtime.totem.pg.mrp.srp.members.425699520.status=joined
That is almost certainly the two bugs Jake pointed out. The good news is that upstream got to the bottom of the problem and it is now fixed. > > So I think I´ve got to swollow the bitter pill and restart the whole cluster. > > I will report about the result. > > Kind regards > > fatcharly > > > -------- Original-Nachricht -------- >> Datum: Fri, 20 Jul 2012 12:21:47 -0400 (EDT) >> Von: Jake Smith <jsm...@argotec.com> >> An: The Pacemaker cluster resource manager <pacemaker@oss.clusterlabs.org> >> Betreff: Re: [Pacemaker] problem with pacemaker/corosync on CentOS 6.3 > >> >> ----- Original Message ----- >> > From: fatcha...@gmx.de >> > To: "Jake Smith" <jsm...@argotec.com>, "The Pacemaker cluster resource >> manager" <pacemaker@oss.clusterlabs.org> >> > Sent: Friday, July 20, 2012 11:50:52 AM >> > Subject: Re: [Pacemaker] problem with pacemaker/corosync on CentOS 6.3 >> > >> > Hi Jake, >> > >> > I erased the files as mentioned und started the services. This is >> > what I get on pilotpound after crm_mon : >> > >> > ============ >> > Last updated: Fri Jul 20 17:45:58 2012 >> > Last change: >> > Current DC: NONE >> > 0 Nodes configured, unknown expected votes >> > 0 Resources configured. >> > ============ >> > >> > >> > Looks like the system didn´t joined the cluster. >> > >> > Any suggestions are welcome >> >> Oh maybe worth checking corosync membership and see what it says now: >> http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership >> >> > >> > Kind regards >> > >> > fatharly >> > >> > ------- Original-Nachricht -------- >> > > Datum: Fri, 20 Jul 2012 10:49:15 -0400 (EDT) >> > > Von: Jake Smith <jsm...@argotec.com> >> > > An: The Pacemaker cluster resource manager >> > > <pacemaker@oss.clusterlabs.org> >> > > Betreff: Re: [Pacemaker] problem with pacemaker/corosync on CentOS >> > > 6.3 >> > >> > > >> > > ----- Original Message ----- >> > > > From: fatcha...@gmx.de >> > > > To: pacemaker@oss.clusterlabs.org >> > > > Sent: Friday, July 20, 2012 6:08:45 AM >> > > > Subject: [Pacemaker] problem with pacemaker/corosync on CentOS >> > > > 6.3 >> > > > >> > > > Hi, >> > > > >> > > > I´m using a pacemaker+corosync bundle to run a pound based >> > > > loadbalancer. After an update on CentOS 6.3 there is some >> > > > mismatch >> > > > of the node status. Via crm_mon on one node eveything looks fine >> > > > while on the other node everything is offline. Everything was >> > > > fine >> > > > on CentOS 6.2. >> > > > >> > > > Node powerpound: >> > > > >> > > > ============ >> > > > Last updated: Fri Jul 20 12:04:29 2012 >> > > > Last change: Thu Jul 19 17:58:31 2012 via crm_attribute on >> > > > pilotpound >> > > > Stack: openais >> > > > Current DC: powerpound - partition with quorum >> > > > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 >> > > > 2 Nodes configured, 2 expected votes >> > > > 7 Resources configured. >> > > > ============ >> > > > >> > > > Online: [ powerpound pilotpound ] >> > > > >> > > > HA_IP_1 (ocf::heartbeat:IPaddr2): Started powerpound >> > > > HA_IP_2 (ocf::heartbeat:IPaddr2): Started powerpound >> > > > HA_IP_3 (ocf::heartbeat:IPaddr2): Started powerpound >> > > > HA_IP_4 (ocf::heartbeat:IPaddr2): Started powerpound >> > > > HA_IP_5 (ocf::heartbeat:IPaddr2): Started powerpound >> > > > Clone Set: pingclone [ping-gateway] >> > > > Started: [ pilotpound powerpound ] >> > > > >> > > > >> > > > Node pilotpound: >> > > > >> > > > ============ >> > > > Last updated: Fri Jul 20 12:04:32 2012 >> > > > Last change: Thu Jul 19 17:58:17 2012 via crm_attribute on >> > > > pilotpound >> > > > Stack: openais >> > > > Current DC: NONE >> > > > 2 Nodes configured, 2 expected votes >> > > > 7 Resources configured. >> > > > ============ >> > > > >> > > > OFFLINE: [ powerpound pilotpound ] >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > from /var/log/messages on pilotpound: >> > > > >> > > > Jul 20 12:06:12 pilotpound cib[24755]: warning: >> > > > cib_peer_callback: >> > > > Discarding cib_apply_diff message (35909) from powerpound: not in >> > > > our mem bership >> > > > Jul 20 12:06:12 pilotpound cib[24755]: warning: >> > > > cib_peer_callback: >> > > > Discarding cib_apply_diff message (35910) from powerpound: not in >> > > > our mem bership >> > > > >> > > > >> > > > >> > > > how could this happened and what can I do to solve this problem ? >> > > >> > > Pretty sure it had nothing to do with upgrade - I had this the >> > > other day >> > > on Ubuntu 12.04 after a reboot of both nodes. I believe a couple >> > > experts >> > > called it a "transient" bug. See: >> > > https://bugzilla.redhat.com/show_bug.cgi?id=820821 >> > > https://bugzilla.redhat.com/show_bug.cgi?id=5040 >> > > >> > > > >> > > > Any suggestions are welcome >> > > >> > > I fixed by stopping/killing pacemaker/corosync on offending node >> > > (pilotpound). Then cleared these files out on same node: >> > > rm /var/lib/heartbeat/crm/cib* >> > > rm /var/lib/pengine/* >> > > >> > > Then restart corosync/pacemaker and the node rejoined fine. >> > > >> > > HTH >> > > >> > > Jake >> > > >> > > _______________________________________________ >> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > > >> > > Project Home: http://www.clusterlabs.org >> > > Getting started: >> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > > Bugs: http://bugs.clusterlabs.org >> > >> > >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org