Re: [Pacemaker] Infinite fail-count and migration-threshold after node fail-back

Dan Frincu Mon, 11 Oct 2010 00:43:23 -0700

Hi all,

I've managed to make this setup work, basically the issue with asymmetric-cluster="false" and specifying the resources' locationmanually means that the resources will always obey the locationconstraint, and (as far as I could see) disregard the rsc_defaultsresource-stickiness values. This behavior is not the expected one, intheory, setting symmetric-cluster="false" should affect whetherresources are allowed to run anywhere by default and theresource-stickiness should lock in place the resources so they don'tbounce from node to node. Again, this didn't happen, but by settingsymmetric-cluster="true", using the same ordering and collocationconstraints and the resource-stickiness, the behavior is the expected one.

I don't remember seeing anywhere in the docs from clusterlabs.org beingmentioned that the resource-stickiness only works onsymmetric-cluster="true", so for anyone that also stumbles upon thisissue, I hope this helps.


Regards,

Dan

Dan Frincu wrote:

Hi,
Since it was brought to my attention that I should upgrade fromopenais-0.80 to a more recent version of corosync, I've done justthat, however I'm experiencing a strange behavior on the cluster.
The same setup was used with the below packages:

# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
openais-0.80.5-15.2
cluster-glue-1.0-12.2
pacemaker-1.0.5-4.2
cluster-glue-libs-1.0-12.2
resource-agents-1.0-31.5
pacemaker-libs-1.0.5-4.2
pacemaker-mgmt-1.99.2-7.2
libopenais2-0.80.5-15.2
heartbeat-3.0.0-33.3
pacemaker-mgmt-client-1.99.2-7.2
Now I've migrated to the most recent stable packages I could find (onthe clusterlabs.org website) for RHEL5:
# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
cluster-glue-1.0.6-1.6.el5
pacemaker-libs-1.0.9.1-1.el5
pacemaker-1.0.9.1-1.el5
heartbeat-libs-3.0.3-2.el5
heartbeat-3.0.3-2.el5
openaislib-1.1.3-1.6.el5
resource-agents-1.0.3-2.el5
cluster-glue-libs-1.0.6-1.6.el5
openais-1.1.3-1.6.el5

Expected behavior:
- all the resources the in group should go (based on locationpreference) to bench1
- if bench1 goes down, resources migrate to bench2
- if bench1 comes back up, resources stay on bench2, unless manuallytold otherwise.
On the previous incantation, this worked, by using the new packages,not so much. Now if bench1 goes down (crm node standby `uname -n`),failover occurs, but when bench1 comes backup up, resources migrateback, even if default-resource-stickiness is set, and more than that,2 drbd block devices reach infinite metrics, most notably because theytry to promote the resources to a Master state on bench1, but fail todo so due to the resource being held open (by some process, I couldnot identify it).
Strangely enough, the resources (drbd) fail to be promoted to a Masterstatus on bench1, so they fail back to bench2, where they are mounted(functional), but crm_mon shows:
Migration summary:
* Node bench2.streamwide.ro:
  drbd_mysql:1: migration-threshold=1000000 fail-count=1000000
  drbd_home:1: migration-threshold=1000000 fail-count=1000000
* Node bench1.streamwide.ro:

.... infinite metrics on bench2, while the drbd resources are available

version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by[email protected], 2009-08-29 14:07:55
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:1632 nr:1864 dw:3512 dr:3933 al:11 bm:19 lo:0 pe:0 ua:0 ap:0ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
   ns:4 nr:24 dw:28 dr:25 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
   ns:4 nr:24 dw:28 dr:85 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

and mounted

/dev/drbd1 on /home type ext3 (rw,noatime,nodiratime)
/dev/drbd0 on /mysql type ext3 (rw,noatime,nodiratime)
/dev/drbd2 on /storage type ext3 (rw,noatime,nodiratime)

Attached is the hb_report.

Thank you in advance.

Best regards


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Infinite fail-count and migration-threshold after node fail-back

Reply via email to