Re: [Pacemaker] Infinite fail-count and migration-threshold after node fail-back

Andrew Beekhof Thu, 11 Nov 2010 02:47:18 -0800

On Mon, Oct 11, 2010 at 9:40 AM, Dan Frincu <[email protected]> wrote:
> Hi all,
>
> I've managed to make this setup work, basically the issue with a
> symmetric-cluster="false" and specifying the resources' location manually
> means that the resources will always obey the location constraint, and (as
> far as I could see) disregard the rsc_defaults resource-stickiness values.


This definitely should not be the case.
Possibly your stickiness setting is being eclipsed by the combination
of the location constraint scores.
Try INFINITY instead.

> This behavior is not the expected one, in theory, setting
> symmetric-cluster="false" should affect whether resources are allowed to run
> anywhere by default and the resource-stickiness should lock in place the
> resources so they don't bounce from node to node. Again, this didn't happen,
> but by setting symmetric-cluster="true", using the same ordering and
> collocation constraints and the resource-stickiness, the behavior is the
> expected one.
>
> I don't remember seeing anywhere in the docs from clusterlabs.org being
> mentioned that the resource-stickiness only works on
> symmetric-cluster="true", so for anyone that also stumbles upon this issue,
> I hope this helps.
>
> Regards,
>
> Dan
>
> Dan Frincu wrote:
>>
>> Hi,
>>
>> Since it was brought to my attention that I should upgrade from
>> openais-0.80 to a more recent version of corosync, I've done just that,
>> however I'm experiencing a strange behavior on the cluster.
>>
>> The same setup was used with the below packages:
>>
>> # rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
>> openais-0.80.5-15.2
>> cluster-glue-1.0-12.2
>> pacemaker-1.0.5-4.2
>> cluster-glue-libs-1.0-12.2
>> resource-agents-1.0-31.5
>> pacemaker-libs-1.0.5-4.2
>> pacemaker-mgmt-1.99.2-7.2
>> libopenais2-0.80.5-15.2
>> heartbeat-3.0.0-33.3
>> pacemaker-mgmt-client-1.99.2-7.2
>>
>> Now I've migrated to the most recent stable packages I could find (on the
>> clusterlabs.org website) for RHEL5:
>>
>> # rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
>> cluster-glue-1.0.6-1.6.el5
>> pacemaker-libs-1.0.9.1-1.el5
>> pacemaker-1.0.9.1-1.el5
>> heartbeat-libs-3.0.3-2.el5
>> heartbeat-3.0.3-2.el5
>> openaislib-1.1.3-1.6.el5
>> resource-agents-1.0.3-2.el5
>> cluster-glue-libs-1.0.6-1.6.el5
>> openais-1.1.3-1.6.el5
>>
>> Expected behavior:
>> - all the resources the in group should go (based on location preference)
>> to bench1
>> - if bench1 goes down, resources migrate to bench2
>> - if bench1 comes back up, resources stay on bench2, unless manually told
>> otherwise.
>>
>> On the previous incantation, this worked, by using the new packages, not
>> so much. Now if bench1 goes down (crm node standby `uname -n`), failover
>> occurs, but when bench1 comes backup up, resources migrate back, even if
>> default-resource-stickiness is set, and more than that, 2 drbd block devices
>> reach infinite metrics, most notably because they try to promote the
>> resources to a Master state on bench1, but fail to do so due to the resource
>> being held open (by some process, I could not identify it).
>>
>> Strangely enough, the resources (drbd) fail to be promoted to a Master
>> status on bench1, so they fail back to bench2, where they are mounted
>> (functional), but crm_mon shows:
>>
>> Migration summary:
>> * Node bench2.streamwide.ro:
>>  drbd_mysql:1: migration-threshold=1000000 fail-count=1000000
>>  drbd_home:1: migration-threshold=1000000 fail-count=1000000
>> * Node bench1.streamwide.ro:
>>
>> .... infinite metrics on bench2, while the drbd resources are available
>>
>> version: 8.3.2 (api:88/proto:86-90)
>> GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by
>> [email protected], 2009-08-29 14:07:55
>> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>>   ns:1632 nr:1864 dw:3512 dr:3933 al:11 bm:19 lo:0 pe:0 ua:0 ap:0 ep:1
>> wo:b oos:0
>> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>>   ns:4 nr:24 dw:28 dr:25 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
>> 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>>   ns:4 nr:24 dw:28 dr:85 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
>>
>> and mounted
>>
>> /dev/drbd1 on /home type ext3 (rw,noatime,nodiratime)
>> /dev/drbd0 on /mysql type ext3 (rw,noatime,nodiratime)
>> /dev/drbd2 on /storage type ext3 (rw,noatime,nodiratime)
>>
>> Attached is the hb_report.
>>
>> Thank you in advance.
>>
>> Best regards
>>
>
> --
> Dan FRINCU
> Systems Engineer
> CCNA, RHCE
> Streamwide Romania
>
>
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Infinite fail-count and migration-threshold after node fail-back

Reply via email to