Hey everyone, We had an interesting issue happen the other night on one of our clusters. A resource attempted to start on an unauthorized node (and failed), which caused the real resource, already running on a different node, to become orphaned and subsequently shut down.
Some background: We're running pacemaker 1.0.12, corosync 1.2.7 on Centos 5.8 x64 The cluster has 3 members: pgsql1c & pgsql1d are physical machines running dual Xeon X5650's with 32 gigs of ram dbquorum which is a vm running on vmware ESX server on HP Blade hardware. The 2 physical machines are configured to be master/slave postgres servers, the vm machine is only there for quorum - it should never run any resources. The full crm configuration is available in this zip (as alink to allow the email to post correctly) https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip On the dbquorum VM we got the following log message: Jun 07 03:11:10 corosync [TOTEM ] Process pause detected for 598 ms, flushing membership messages. After this it appears that somehow even though the Cluster-Postgres-Server-1 and Postgres-IP-1 resources are only setup to run on pgsql1c/d the dbquorum box tried to start them up WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0 on dbquorum.example.com: unknown error (1) WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0 on dbquorum.example.com: unknown error (1) info: find_clone: Internally renamed Postgres-Server-1:0 on pgsql1c.example.com to Postgres-Server-1:1 info: find_clone: Internally renamed Postgres-Server-1:1 on pgsql1d.example.com to Postgres-Server-1:2 (ORPHAN) WARN: process_rsc_state: Detected active orphan Postgres-Server-1:2 running on pgsql1d.example.com ERROR: native_add_running: Resource ocf::IPaddr2:Postgres-IP-1 appears to be active on 2 nodes. WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. notice: native_print: Postgres-IP-1 (ocf::heartbeat:IPaddr2) Started FAILED notice: native_print: 0 : dbquorum.example.com notice: native_print: 1 : pgsql1d.example.com notice: clone_print: Master/Slave Set: Cluster-Postgres-Server-1 notice: native_print: Postgres-Server-1:0 (ocf::custom:pgsql): Slave dbquorum.example.com FAILED notice: native_print: Postgres-Server-1:2 (ocf::custom:pgsql): ORPHANED Master pgsql1d.example.com notice: short_print: Slaves: [ pgsql1c.example.com ] ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1c.example.com] = 100 ERROR: common_apply_stickiness: Postgres-Server-1:0[pgsql1d.example.com] = 100 ERROR: clone_color: Postgres-Server-1:0 is running on dbquorum.example.com which isn't allowed info: native_color: Stopping orphan resource Postgres-Server-1:2 The stopping of the orphaned resource caused our master to stop, luckily the slave correctly got promoted to master and we had no outage. There seems to be several things that went wrong here: 1. The VM pause - doing some searching I found some posts with regards to the pause message and VM's. We've upped the priority of our dbquorum box on the VM host, the other posts seem to talk about the token configuration option in totem but we haven't set that so it seems like it should be the default of 1000ms so it doesn't seem likely that changing this setting would have made any difference in this situation. We looked at the VM host and couldn't see anything on the physical host at the time that would cause this pause. 2. The quorum machine tried to start resources it is not authorized for - symmetric-cluster is set to false and there is no location entry for that node/resource...why would it try to start it? 3. The 2 machines that stayed up got corrupted when the 3rd came back - the 2 primary machines never lost quorum so...when the 3rd machine came back and told them it was now the postgres master, why would they believe it? and then subsequently shut down the proper master that they should know full well is the true master? I would have expected the dbquorum machine changes to have been rejected by the other 2 that had quorum. The logs and config are in this zip https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip...pgsql1d was the DC at the time of the issue. If anyone has any ideas as to why this happened and/or changes we can make to our config to prevent it happening again that would be great. Thanks!
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org