On Fri, Jun 15, 2012 at 12:15 AM, Mike Roest <msro...@gmail.com> wrote: > Hey everyone, > We had an interesting issue happen the other night on one of our > clusters. A resource attempted to start on an unauthorized node (and > failed),
Just from the logs below, that does not seem to be the case. What I see is pacemaker attempting to determine the state of Postgres-Server and Postgres-IP-1 on dbquorum.example.com and those operations failing: > WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0 > on dbquorum.example.com: unknown error (1) > WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0 > on dbquorum.example.com: unknown error (1) Under such conditions, pacemaker must assume that the resources are active and initiates recovery. Your real question should be, why did the monitor op fail with rc=1 (instead of rc=7) for those two resources on dbquorum? > which caused the real resource, already running on a different > node, to become orphaned and subsequently shut down. > > Some background: > We're running pacemaker 1.0.12, corosync 1.2.7 on Centos 5.8 x64 > > The cluster has 3 members: > pgsql1c & pgsql1d are physical machines running dual Xeon X5650's with 32 > gigs of ram > dbquorum which is a vm running on vmware ESX server on HP Blade hardware. > > The 2 physical machines are configured to be master/slave postgres servers, > the vm machine is only there for quorum - it should never run any resources. > The full crm configuration is available in this zip (as alink to allow the > email to post correctly) > https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip > > On the dbquorum VM we got the following log message: > Jun 07 03:11:10 corosync [TOTEM ] Process pause detected for 598 ms, > flushing membership messages. > > After this it appears that somehow even though the Cluster-Postgres-Server-1 > and Postgres-IP-1 resources are only setup to run on pgsql1c/d the dbquorum > box tried to start them up > > WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0 > on dbquorum.example.com: unknown error (1) > WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0 > on dbquorum.example.com: unknown error (1) > info: find_clone: Internally renamed Postgres-Server-1:0 > on pgsql1c.example.com to Postgres-Server-1:1 > info: find_clone: Internally renamed Postgres-Server-1:1 > on pgsql1d.example.com to Postgres-Server-1:2 (ORPHAN) > WARN: process_rsc_state: Detected active orphan Postgres-Server-1:2 running > on pgsql1d.example.com > ERROR: native_add_running: Resource ocf::IPaddr2:Postgres-IP-1 appears to be > active on 2 nodes. > WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more > information. > notice: native_print: Postgres-IP-1 (ocf::heartbeat:IPaddr2) Started FAILED > notice: native_print: 0 : dbquorum.example.com > notice: native_print: 1 : pgsql1d.example.com > notice: clone_print: Master/Slave Set: Cluster-Postgres-Server-1 > notice: native_print: Postgres-Server-1:0 (ocf::custom:pgsql): > Slave dbquorum.example.com FAILED > notice: native_print: Postgres-Server-1:2 (ocf::custom:pgsql): > ORPHANED Master pgsql1d.example.com > notice: short_print: Slaves: [ pgsql1c.example.com ] > ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1c.example.com] = > 100 > ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1d.example.com] = > 100 > ERROR: clone_color: Postgres-Server-1:0 is running > on dbquorum.example.com which isn't allowed > info: native_color: Stopping orphan resource Postgres-Server-1:2 > > The stopping of the orphaned resource caused our master to stop, luckily the > slave correctly got promoted to master and we had no outage. > > There seems to be several things that went wrong here: > 1. The VM pause - doing some searching I found some posts with regards to > the pause message and VM's. We've upped the priority of our dbquorum box on > the VM host, the other posts seem to talk about the token configuration > option in totem but we haven't set that so it seems like it should be the > default of 1000ms so it doesn't seem likely that changing this setting would > have made any difference in this situation. We looked at the VM host and > couldn't see anything on the physical host at the time that would cause this > pause. > 2. The quorum machine tried to start resources it is not authorized for - > symmetric-cluster is set to false and there is no location entry for that > node/resource...why would it try to start it? > 3. The 2 machines that stayed up got corrupted when the 3rd came back - the > 2 primary machines never lost quorum so...when the 3rd machine came back and > told them it was now the postgres master, why would they believe it? and > then subsequently shut down the proper master that they should know full > well is the true master? I would have expected the dbquorum machine changes > to have been rejected by the other 2 that had quorum. > > The logs and config are in this zip > https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip...pgsql1d was the DC > at the time of the issue. > > If anyone has any ideas as to why this happened and/or changes we can make > to our config to prevent it happening again that would be great. > > Thanks! > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org