Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?
Hi, Ken Thanks for your comment. Network fencing that's a valid means, I also think. However, I think that the reliance on equipment is strong. Since we do not have an SNMP-capable network switch in our environment, we can not immediately try it. Thanks, Yusuke > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ken Gaillot > Sent: Friday, April 06, 2018 11:12 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > Subject: Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an > environment using fence_mpath? > > On Fri, 2018-04-06 at 04:30 +, 飯田 雄介 wrote: > > Hi, all > > I am testing the environment using fence_mpath with the following > > settings. > > > > === > > Stack: corosync > > Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with > > quorum > > Last updated: Fri Apr 6 13:16:20 2018 > > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > > > 2 nodes configured > > 13 resources configured > > > > Online: [ x3650e x3650f ] > > > > Full list of resources: > > > > fenceMpath-x3650e(stonith:fence_mpath): Started x3650e > > fenceMpath-x3650f(stonith:fence_mpath): Started x3650f > > Resource Group: grpPostgreSQLDB > > prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Start > > ed x3650e > > prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Start > > ed x3650e > > prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Start > > ed x3650e > > prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e > > Resource Group: grpPostgreSQLIP > > prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Start > > ed x3650e > > Clone Set: clnDiskd1 [prmDiskd1] > > Started: [ x3650e x3650f ] > > Clone Set: clnDiskd2 [prmDiskd2] > > Started: [ x3650e x3650f ] > > Clone Set: clnPing [prmPing] > > Started: [ x3650e x3650f ] > > === > > > > When split-brain occurs in this environment, x3650f executes fence and > > the resource is started with x3650f. > > > > === view of x3650e > > Stack: corosync > > Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition > > WITHOUT quorum > > Last updated: Fri Apr 6 13:16:36 2018 > > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > > > 2 nodes configured > > 13 resources configured > > > > Node x3650f: UNCLEAN (offline) > > Online: [ x3650e ] > > > > Full list of resources: > > > > fenceMpath-x3650e(stonith:fence_mpath): Started x3650e > > fenceMpath-x3650f(stonith:fence_mpath): Started[ x3650e x3650f > > ] > > Resource Group: grpPostgreSQLDB > > prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Start > > ed x3650e > > prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Start > > ed x3650e > > prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Start > > ed x3650e > > prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e > > Resource Group: grpPostgreSQLIP > > prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Start > > ed x3650e > > Clone Set: clnDiskd1 [prmDiskd1] > > prmDiskd1(ocf::pacemaker:diskd): Started x3650f > > (UNCLEAN) > > Started: [ x3650e ] > > Clone Set: clnDiskd2 [prmDiskd2] > > prmDiskd2(ocf::pacemaker:diskd): Started x3650f > > (UNCLEAN) > > Started: [ x3650e ] > > Clone Set: clnPing [prmPing] > > prmPing (ocf::pacemaker:ping): Started x3650f (UNCLEAN) > > Started: [ x3650e ] > > > > === view of x3650f > > Stack: corosync > > Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition > > WITHOUT quorum > > Last updated: Fri Apr 6 13:16:36 2018 > > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > > > 2 nodes configured > > 13 resources configured > > > > Online: [ x3650f ] > > OFFLINE: [ x3650e ] > > > > Full list of resources: > > > > fenceMpath-x3650e(stonith:fence_mpath): Started x3650f > > fenceMpath-x3650f(stonith:fence_mpath): Started x3650f > > Resource Group: grpPostgreSQLDB > > prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Start > > ed x3650f > > prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Start > > ed x3650f > > prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Start > > ed x3650f > > prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f > > Resource Group: grpPostgreSQLIP > > prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Start > > ed x3650f > > Clone Set: clnDiskd1 [prmDiskd1] > > Started: [ x3650f ] > > Stopped: [ x3650e ] > > Clone Set: clnDiskd2 [prmDiskd2] > > Started: [ x3650f ] > > Stopped: [ x3650e ] > > Clone Set: clnPing [prmPing] > > Started: [ x3650f ] > > Stopped: [ x3650e ] > >
[ClusterLabs] 答复: No slave is promoted to be master
Thank you very much, Rorthais, I see now. I have two more questions. 1. If I change the "cluster-recheck-interval" parameter from the default 15 minutes to 10 seconds, is there any bad impact? Could this be a workaround? 2. This issue happens only in the following configuration. [cid:image003.jpg@01D3D700.2F3E24D0] But it does not happen in the following configuration. Why is the behaviors different? [cid:image004.jpg@01D3D700.2F3E24D0] -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月17日 17:47 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] No slave is promoted to be master On Tue, 17 Apr 2018 04:16:38 + 范国腾 mailto:fanguot...@highgo.com>> wrote: > I check the status again. It is not not promoted but it promoted about > 15 minutes after the cluster starts. > > I try in three labs and the results are same: The promotion happens 15 > minutes after the cluster starts. > > Why is there about 15 minutes delay every time? This was a bug in Pacemaker up to 1.1.17. I did a report about this last August and Ken Gaillot fixed it few days later in 1.1.18. See: https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html I wonder if disabling the pgsql resource before shutting down the cluster might be a simpler and safer workaround. Eg.: pcs resource disable pgsql-ha --wait pcs cluster stop --all and pcs cluster start --all pcs resource enable pgsql-ha Another fix would be to force a master score on one node **if needed** using: crm_master -N -r -l forever -v 1 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Booth fail-over conditions
Hi, On Mon, Apr 16, 2018 at 01:22:08PM +0200, Kristoffer Grönlund wrote: > Zach Anderson writes: > > > Hey all, > > > > new user to pacemaker/booth and I'm fumbling my way through my first proof > > of concept. I have a 2 site configuration setup with local pacemaker > > clusters at each site (running rabbitmq) and a booth arbitrator. I've > > successfully validated the base failover when the "granted" site has > > failed. My question is if there are any other ways to configure failover, > > i.e. using resource health checks or the like? You can take a look at "before-acquire-handler" (quite a mouthful there). The main motivation was to add an ability to verify that some other conditions at _the site_ are good, perhaps using environment sensors, say to measure temperature, or if the aircondition works, or such. Nothing stopping you from doing there a resource health check, but it could probably be deemed as something on a rather different "level". > > Hi Zach, > > Do you mean that a resource health check should trigger site failover? > That's actually something I'm not sure comes built-in.. There's nothing really specific about a resource, because booth knows nothing about resources. The tickets are the only way it can describe the world ;-) Cheers, Dejan > though making a > resource agent which revokes a ticket on failure should be fairly > straight-forward. You could then group your resource which the ticket > resource to enable this functionality. > > The logic in the ticket resource ought to be something like "if monitor > fails and the current site is granted, then revoke the ticket, else do > nothing". You would probably want to handle probe monitor invocations > differently. There is a ocf_is_probe function provided to help with > this. > > Cheers, > Kristoffer > > > Thanks! > > ___ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?
Hi, Andrei Thanks for your comment. We are not assuming node level fencing in the current environment. I tried the power_timeout setting that you taught. However, fence_mpath immediately returns the status off when you execute the off action. https://github.com/ClusterLabs/fence-agents/blob/v4.0.25/fence/agents/lib/fencing.py.py#L744 Therefore, we could not wait to stop IPaddr2 using this option. I read the code and learned the power_wait option. With this option you can delay the completion of STONITH by the specified amount of time, so it seems to meet our requirements. Thanks, Yusuke > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Andrei > Borzenkov > Sent: Friday, April 06, 2018 2:04 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an > environment using fence_mpath? > > 06.04.2018 07:30, 飯田 雄介 пишет: > > Hi, all > > I am testing the environment using fence_mpath with the following settings. > > > > === > > Stack: corosync > > Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum > > Last updated: Fri Apr 6 13:16:20 2018 > > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > > > 2 nodes configured > > 13 resources configured > > > > Online: [ x3650e x3650f ] > > > > Full list of resources: > > > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650e > >fenceMpath-x3650f(stonith:fence_mpath): Started x3650f > >Resource Group: grpPostgreSQLDB > >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started > x3650e > >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started > x3650e > >prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started > x3650e > >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e > >Resource Group: grpPostgreSQLIP > >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started > x3650e > >Clone Set: clnDiskd1 [prmDiskd1] > >Started: [ x3650e x3650f ] > >Clone Set: clnDiskd2 [prmDiskd2] > >Started: [ x3650e x3650f ] > >Clone Set: clnPing [prmPing] > >Started: [ x3650e x3650f ] > > === > > > > When split-brain occurs in this environment, x3650f executes fence and the > resource is started with x3650f. > > > > === view of x3650e > > Stack: corosync > > Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT > quorum > > Last updated: Fri Apr 6 13:16:36 2018 > > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > > > 2 nodes configured > > 13 resources configured > > > > Node x3650f: UNCLEAN (offline) > > Online: [ x3650e ] > > > > Full list of resources: > > > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650e > >fenceMpath-x3650f(stonith:fence_mpath): Started[ x3650e x3650f ] > >Resource Group: grpPostgreSQLDB > >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started > x3650e > >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started > x3650e > >prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started > x3650e > >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e > >Resource Group: grpPostgreSQLIP > >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started > x3650e > >Clone Set: clnDiskd1 [prmDiskd1] > >prmDiskd1(ocf::pacemaker:diskd): Started x3650f (UNCLEAN) > >Started: [ x3650e ] > >Clone Set: clnDiskd2 [prmDiskd2] > >prmDiskd2(ocf::pacemaker:diskd): Started x3650f (UNCLEAN) > >Started: [ x3650e ] > >Clone Set: clnPing [prmPing] > >prmPing (ocf::pacemaker:ping): Started x3650f (UNCLEAN) > >Started: [ x3650e ] > > > > === view of x3650f > > Stack: corosync > > Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT > quorum > > Last updated: Fri Apr 6 13:16:36 2018 > > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > > > 2 nodes configured > > 13 resources configured > > > > Online: [ x3650f ] > > OFFLINE: [ x3650e ] > > > > Full list of resources: > > > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650f > >fenceMpath-x3650f(stonith:fence_mpath): Started x3650f > >Resource Group: grpPostgreSQLDB > >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started > x3650f > >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started > x3650f > >prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started > x3650f > >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f > >Resource Group: grpPostgreSQLIP > >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started > x3650f > >Clone Set: clnDiskd1 [prmDiskd1] > >Started: [ x3650f ] > >Stoppe
Re: [ClusterLabs] No slave is promoted to be master
On Tue, 17 Apr 2018 04:16:38 + 范国腾 wrote: > I check the status again. It is not not promoted but it promoted about 15 > minutes after the cluster starts. > > I try in three labs and the results are same: The promotion happens 15 > minutes after the cluster starts. > > Why is there about 15 minutes delay every time? This was a bug in Pacemaker up to 1.1.17. I did a report about this last August and Ken Gaillot fixed it few days later in 1.1.18. See: https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html I wonder if disabling the pgsql resource before shutting down the cluster might be a simpler and safer workaround. Eg.: pcs resource disable pgsql-ha --wait pcs cluster stop --all and pcs cluster start --all pcs resource enable pgsql-ha Another fix would be to force a master score on one node **if needed** using: crm_master -N -r -l forever -v 1 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org