On Fri, 2012-08-10 at 12:21 +1000, Andrew Beekhof wrote: > On Thu, Aug 9, 2012 at 12:14 PM, Bob Haxo <bhaxo at sgi.com> wrote: > > Greetings. > > > > I have followed the setup instructions of Clusters From Scratch : > > Creating Active/Passive and Active/Active Clusters on Fedora, Edition 5, > > including locating the new cman pages that do not seem to be linked into > > the main document, for example, > > > > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02s02.html > > The 1.1 document was updated for corosync 2.x > I kept the cman/plugin version around but moved it to: > > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/index.html > > Look for "Version: 1.1-plugin" on the main docs page.
Andrew, much thanks for the response ... and much thanks here ... I had not connected the dots regarding use of cman being an *earlier* version of the docs (and software stack). > > > > > The stack that I'm implementing includes RHEL6.3, drbd, dlm, gfs2, > > Pacemaker (RHEL6.3 build), cman, kvm ... hopefully I didn't leave > > anybody off the party list. > > > > I have these all working together to support "live" migration of the > > virt client between the two phys hosts, so at that level, all is good. > > > > Questions: Is there a document that covers the fully covers such an > > installation, meaning the extends the Cluster From Scratch (and replaces > > the Apache example) to implementation of a HA virtual client? For > > instance, should libvirtd be handled as a Pacemaker resource, or should > > it be started as an system service at boot? What should be done with > > "libvirt-guests"? > > These things I do not know sorry. > > > Should cman be started as a system service at boot? > > I prefer not to, but its just a personal preference. > I run potentially broken versions of the cluster and have been hit > hard before with processes running amok and putting machines into > reboot cycles. Ah, right. I too in my testing start cman and pacemaker manually. I was thinking more of when moving from testing to production. I think you have answered that. > > > > > Problem: When the the non-VM-host is rebooted, then when Pacemaker > > restarts the gfs2 filesystem gets restarted on the VM host, which causes > > the stop and start of the VirtualDomain. The gfs2 filesystem also gets > > restarted without of the VirtualDomain resource included. > > This sounds like the "starting a clone on A causes a restart of the > clone on B" bug. > I think we've squashed that one now but not in a released version... > how confident are you at creating rpms? :-) Well "how confident" depends upon the precise meaning of "creating rpms" .. if this is building a rpm given a working spec file, then that I can do. If it is a matter of making mods to an almost working spec file, that I can do. If it involves creating the spec file from scratch for a large project, that would be a challenge. FYI, I'm trying to get Pacemaker accepted for use in a product rather than rgmanager. Thanks, Andrew. Bob Haxo bhaxo at sgi.com > > > This behavior does not seem correct ... I think I would have flagged it > > in my memory if I'd encountered the behavior when working with the SLES > > HAE product. I've been doing a lot of fumbling this past week trying to > > get the colocation and order statements correct, without affecting this > > behavior. > > > > What am I missing? > > > > Here are the first indications of this restart issue during the restart > > of Pacemaker and friends with the boot. I have attached more messages. > > > > Aug 8 20:00:57 hikari crmd[2734]: info: abort_transition_graph: > > te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, > > id=status-hikari2-master-drbd_r0.1, name=master-drbd_r0:1, value=5, > > magic=NA, cib=0.474.170) : Transient attribute: update > > Aug 8 20:00:57 hikari crmd[2734]: notice: do_state_transition: State > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > > origin=abort_transition_graph ] > > Aug 8 20:00:57 hikari pengine[2733]: notice: unpack_config: On loss of > > CCM Quorum: Ignore > > Aug 8 20:00:57 hikari pengine[2733]: notice: LogActions: Promote > > drbd_r0:1#011(Slave -> Master hikari2) > > Aug 8 20:00:57 hikari pengine[2733]: notice: LogActions: Restart > > virt#011(Started hikari) <<<<<<<<<<<<<<<<<< > > Aug 8 20:00:57 hikari pengine[2733]: notice: LogActions: Restart > > shared-gfs2:0#011(Started hikari) <<<<<<<< > > Aug 8 20:00:57 hikari pengine[2733]: notice: LogActions: Start > > shared-gfs2:1#011(hikari2) > > Aug 8 20:00:57 hikari crmd[2734]: info: abort_transition_graph: > > te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, > > id=status-hikari2-master-drbd_r1.1, name=master-drbd_r1:1, value=5, > > magic=NA, cib=0.474.171) : Transient attribute: update > > > > Here are the current constraints resulting from fumbling (actually, > > trying to make sense of all of the information obtained from a Google > > searches): > > > > colocation co-gfs-on-drbd inf: c_shared-gfs2 drbd_r0_clone:Master > > order o-drbd_r0-then-gfs inf: drbd_r0_clone:promote c_shared-gfs2:start > > order o-drbd_r1_clone-then-virt inf: drbd_r1_clone virt > > order o-gfs-then-virt inf: c_shared-gfs2 virt > > > > Full config file attached. > > > > For reference, here is "service blah status" for the set of services: > > > > [root@hikari2 ~]# ha-status > > ------- service corosync status ------- > > corosync (pid 1996) is running... > > ------- service cman status ------- > > cluster is running. > > ------- service drbd status ------- > > drbd driver loaded OK; device status: > > version: 8.4.1 (api:1/proto:86-100) > > GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by > > phil@Build64R6, 2012-04-17 11:28:08 > > m:res cs ro ds p mounted fstype > > 1:r0 Connected Primary/Primary UpToDate/UpToDate C /shared gfs2 > > 2:r1 Connected Primary/Primary UpToDate/UpToDate C > > 3:r2 Connected Primary/Primary UpToDate/UpToDate C > > ------- service pacemaker status ------- > > pacemakerd (pid 8912) is running... > > ------- service gfs2 status ------- > > Configured GFS2 mountpoints: > > /shared > > Active GFS2 mountpoints: > > /shared > > ------- service libvirtd status ------- > > libvirtd (pid 2510) is running... > > > > [root@hikari ~]# crm_mon -1ro > > ============ > > Last updated: Wed Aug 8 21:01:47 2012 > > Last change: Wed Aug 8 20:48:49 2012 via cibadmin on hikari > > Stack: cman > > Current DC: hikari - partition with quorum > > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 > > 2 Nodes configured, 2 expected votes > > 11 Resources configured. > > ============ > > > > Online: [ hikari hikari2 ] > > > > Full list of resources: > > > > Master/Slave Set: drbd_r0_clone [drbd_r0] > > Masters: [ hikari hikari2 ] > > Master/Slave Set: drbd_r1_clone [drbd_r1] > > Masters: [ hikari hikari2 ] > > Master/Slave Set: drbd_r2_clone [drbd_r2] > > Masters: [ hikari hikari2 ] > > ipmi-fencing-1 (stonith:fence_ipmilan): Started hikari > > ipmi-fencing-2 (stonith:fence_ipmilan): Started hikari2 > > virt (ocf::heartbeat:VirtualDomain): Started hikari > > Clone Set: c_shared-gfs2 [shared-gfs2] > > Started: [ hikari hikari2 ] > > > > Operations: > > * Node hikari2: > > drbd_r1:1: migration-threshold=1000000 > > + (17) monitor: interval=60000ms rc=0 (ok) > > + (26) promote: rc=0 (ok) > > drbd_r0:1: migration-threshold=1000000 > > + (21) promote: rc=0 (ok) > > drbd_r2:1: migration-threshold=1000000 > > + (19) monitor: interval=60000ms rc=0 (ok) > > + (27) promote: rc=0 (ok) > > ipmi-fencing-2: migration-threshold=1000000 > > + (12) start: rc=0 (ok) > > + (13) monitor: interval=240000ms rc=0 (ok) > > shared-gfs2:1: migration-threshold=1000000 > > + (25) start: rc=0 (ok) > > * Node hikari: > > drbd_r1:0: migration-threshold=1000000 > > + (24) promote: rc=0 (ok) > > drbd_r2:0: migration-threshold=1000000 > > + (25) promote: rc=0 (ok) > > shared-gfs2:0: migration-threshold=1000000 > > + (92) start: rc=0 (ok) > > drbd_r0:0: migration-threshold=1000000 > > + (23) promote: rc=0 (ok) > > ipmi-fencing-1: migration-threshold=1000000 > > + (12) start: rc=0 (ok) > > + (13) monitor: interval=240000ms rc=0 (ok) > > virt: migration-threshold=1000000 > > + (120) start: rc=0 (ok) > > + (121) monitor: interval=10000ms rc=0 (ok) > > > > Thanks for reading ... > > Bob Haxo > > bhaxo @ sgi.com > > > > _______________________________________________ > > Pacemaker mailing list: [email protected] > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
