Re: [Pacemaker] Preventing auto-fail-back

Daniel Bozeman Wed, 18 May 2011 13:00:10 -0700

Sorry for the double-post--- I find pinging the network gateway (192.168.1.1) 
works better actually. Otherwise, the nodes will have equal pingd scores as the 
pingd resource is cloned.


On May 18, 2011, at 2:45 PM, Daniel Bozeman wrote:

> Here is my solution for others to reference. It may not be ideal or possible 
> for everyone, and I am up for suggestions.
> 
> I've got two machines connected via crossover (will be two crossovers for 
> redundancy in production) with static IPs. Corosync communicates over this 
> network. Then each machine is connected to the main network (.1.77 and .1.78)
> 
> This way, the machines can continue to communicate with one-another despite a 
> network failure affecting one machine and react appropriately.
> 
> Using postgres as a test resource, I have the following (desired) results:
> 
> The primary node loses network connectivity and postgres is fired up on the 
> other. When the former primary node regains connectivity, the process does 
> not failback nor does it restart.
> 
> Please see my configuration below
> 
> node postmaster
> node postslave
> primitive pingd ocf:pacemaker:pingd \
>         params host_list="192.168.1.77 192.168.1.78" multiplier="100" \
>         op monitor interval="15s" timeout="5s"
> primitive postgres lsb:postgresql \
>         op monitor interval="20s"
> clone pingdclone pingd \
>         meta globally-unique="false"
> location postgres_location postgres \
>         rule $id="postgres_location-rule" pingd: defined pingd
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1305736421"
> 
> Naturally, this is a very simple configuration that only tests network 
> failure failover and failback prevention.
> 
> Are there any downsides to my method? I'd love to hear feedback. Thank you 
> all for your help. "on-fail=standby" did absolutely nothing for me by the way.
> 
> On May 18, 2011, at 9:16 AM, Daniel Bozeman wrote:
> 
>> I was originally using heartbeat and my original config as I mentioned in my 
>> first post, but I moved on to set up a config identical to that in the 
>> documentation for troubleshooting.
>> 
>> Why is the "on-fail=standby" not optimal? I have tried this in the past but 
>> it did not help. As far as I can tell, pacemaker does not consider a loss of 
>> network connectivity a failure on the part of the server itself or any of 
>> its resources. As I've said, everything works fine should I kill a process, 
>> kill corosync, etc.
>> 
>> I think this may be what I am looking for:
>> 
>> http://www.clusterlabs.org/wiki/Example_configurations#Set_up_pingd
>> 
>> But I am still having issues. How can I reset the scores once the node has 
>> been recovered? Is there some sort of "score reset" command? Once the node 
>> is set to -INF as this example shows, nothing is going to return to it.
>> 
>> Thank you all for your help
>> 
>> On May 18, 2011, at 4:02 AM, Dan Frincu wrote:
>> 
>>> Hi,
>>> 
>>> On Wed, May 18, 2011 at 11:30 AM, Max Williams <max.willi...@betfair.com> 
>>> wrote:
>>> Hi Daniel,
>>> 
>>> You might want to set “on-fail=standby” for the resource group or 
>>> individual resources. This will put the host in to standby when a failure 
>>> occurs thus preventing failback:
>>> 
>>> 
>>> This is not the most optimal solution.
>>>  
>>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html#s-resource-failure
>>> 
>>>  
>>> Another option is to set resource stickiness which will stop resources 
>>> moving back after a failure:
>>> 
>>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html
>>> 
>>> 
>>> That is set globally in his config.
>>>  
>>>  
>>> Also note if you are using a two node cluster you will also need the 
>>> property “no-quorum-policy=ignore” set.
>>> 
>>> 
>>> This as well.
>>>  
>>>  
>>> Hope that helps!
>>> 
>>> Cheers,
>>> 
>>> Max
>>> 
>>>  
>>> From: Daniel Bozeman [mailto:daniel.boze...@americanroamer.com] 
>>> Sent: 17 May 2011 19:09
>>> To: pacemaker@oss.clusterlabs.org
>>> Subject: Re: [Pacemaker] Preventing auto-fail-back
>>> 
>>>  
>>> To be more specific:
>>> 
>>>  
>>> I've tried following the example on page 25/26 of this document to the 
>>> teeth: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> 
>>> 
>>> Well, not really, that's why there are errors in your config.
>>>  
>>>  
>>> And it does work as advertised. When I stop corosync, the resource goes to 
>>> the other node. I start corosync and it remains there as it should.
>>> 
>>>  
>>> However, if I simply unplug the ethernet connection, let the resource 
>>> migrate, then plug it back in, it will fail back to the original node. Is 
>>> this the intended behavior? It seems a bad NIC could wreck havoc on such a 
>>> setup.
>>> 
>>>  
>>> Thanks!
>>> 
>>>  
>>> Daniel
>>> 
>>>  
>>> On May 16, 2011, at 5:33 PM, Daniel Bozeman wrote:
>>> 
>>> 
>>> 
>>> 
>>> For the life of me, I cannot prevent auto-failback from occurring in a 
>>> master-slave setup I have in virtual machines. I have a very simple 
>>> configuration:
>>> 
>>> node $id="4fe75075-333c-4614-8a8a-87149c7c9fbb" ha2 \
>>>        attributes standby="off"
>>> node $id="70718968-41b5-4aee-ace1-431b5b65fd52" ha1 \
>>>        attributes standby="off"
>>> primitive FAILOVER-IP ocf:heartbeat:IPaddr \
>>>        params ip="192.168.1.79" \
>>>        op monitor interval="10s"
>>> primitive PGPOOL lsb:pgpool2 \
>>>        op monitor interval="10s"
>>> group PGPOOL-AND-IP FAILOVER-IP PGPOOL
>>> colocation IP-WITH-PGPOOL inf: FAILOVER-IP PGPOOL
>>> property $id="cib-bootstrap-options" \
>>>        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
>>> 
>>> Change to cluster-infrastructure="openais"
>>>  
>>>        cluster-infrastructure="Heartbeat" \
>>>        stonith-enabled="false" \
>>>        no-quorum-policy="ignore"
>>> 
>>> You're missing expected-quorum-votes here, it should be 
>>> expected-quorum-votes="2" and it's usually added automatically when the 
>>> nodes are added/seen to/by the cluster, I assume it's related to the 
>>> cluster-infrastructure="Heartbeat".
>>> 
>>> Regards,
>>> Dan
>>>  
>>> rsc_defaults $id="rsc-options" \
>>>        resource-stickiness="1000"
>>> 
>>> No matter what I do with resource stickiness, I cannot prevent fail-back. I 
>>> usually don't have a problem with failback when I restart the current 
>>> master, but when I disable network connectivity to the master, everything 
>>> fails over fine. Then I enable the network adapter and everything jumps 
>>> back to the original "failed" node. I've done some "watch ptest -Ls"ing, 
>>> and the scores seem to signify that failback should not occur. I'm also 
>>> seeing resources bounce more times than necessary when a node is added (~3 
>>> times each) and resources seem to bounce when a node returns to the cluster 
>>> even if it isn't necessary for them to do so. I also had an order directive 
>>> in my configuration at one time, and often the second resource would start, 
>>> then stop, then allow the first resource to start, then start itself. Quite 
>>> weird. Any nods in the right direction would be greatly appreciated. I've 
>>> scoured Google and read the official documentation to no avail. I suppose I 
>>> should mention I am using heartbeat as well. My LSB resource implements 
>>> start/stop/status properly without error.
>>> 
>>> I've been testing this with a floating IP + Postgres as well with the same 
>>> issues. One thing I notice is that my "group" resources have no score. Is 
>>> this normal? There doesn't seem to be any way to assign a stickiness to a 
>>> group, and default stickiness has no effect.
>>> 
>>> Thanks!
>>> 
>>> Daniel Bozeman
>>> 
>>>  
>>> Daniel Bozeman
>>> American Roamer
>>> Systems Administrator
>>> daniel.boze...@americanroamer.com
>>> 
>>>  
>>> 
>>> ________________________________________________________________________
>>> In order to protect our email recipients, Betfair Group use SkyScan from 
>>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>> 
>>> ________________________________________________________________________
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Dan Frincu
>>> CCNA, RHCE
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> 
>> Daniel Bozeman
>> American Roamer
>> Systems Administrator
>> daniel.boze...@americanroamer.com
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> Daniel Bozeman
> American Roamer
> Systems Administrator
> daniel.boze...@americanroamer.com
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Daniel Bozeman
American Roamer
Systems Administrator
daniel.boze...@americanroamer.com

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Preventing auto-fail-back

Reply via email to