On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <and...@beekhof.net> wrote:

> 
> On 8 Nov 2013, at 12:59 pm, Sean Lutner <s...@rentul.net> wrote:
> 
>> 
>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>> 
>>> 
>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote:
>>> 
>>>> I have a confusing situation that I'm hoping to get help with. Last night 
>>>> after configuring STONITH on my two node cluster, I suddenly have a 
>>>> "ghost" node in my cluster. I'm looking to understand the best way to 
>>>> remove this node from the config.
>>>> 
>>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on 
>>>> each node, registered the device with stonith_admin -R -a fence_ec2 and 
>>>> confirmed the registration with both
>>>> 
>>>> # stonith_admin -I
>>>> # pcs stonith list
>>>> 
>>>> I then configured STONITH per the Clusters from Scratch doc
>>>> 
>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>>> 
>>>> Here are my commands:
>>>> # pcs cluster cib stonith_cfg
>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
>>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" 
>>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" 
>>>> timeout="150s" op start start-delay="30s" interval="0"
>>>> # pcs -f stonith_cfg stonith
>>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>>> # pcs -f stonith_cfg property
>>>> # pcs cluster push cib stonith_cfg
>>>> 
>>>> After that I saw that STONITH appears to be functioning but a new node 
>>>> listed in pcs status output:
>>> 
>>> Do the EC2 instances have fixed IPs?
>>> I didn't have much luck with EC2 because every time they came back up it 
>>> was with a new name/address which confused corosync and created situations 
>>> like this.
>> 
>> The IPs persist across reboots as far as I can tell. I thought the problem 
>> was due to stonith being enabled but not working so I removed the stonith_id 
>> and disabled stonith. After that I restarted pacemaker and cman on both 
>> nodes and things started as expected but the ghost node it still there. 
>> 
>> Someone else working on the cluster exported the CIB, removed the node and 
>> then imported the CIB. They used this process 
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
>> 
>> Even after that, the ghost node is still there? Would pcs cluster cib > 
>> /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after 
>> editing the node out of the config?
> 
> No. If its coming back then pacemaker is holding it in one of its internal 
> caches.
> The only way to clear it out in your version is to restart pacemaker on the 
> DC.
> 
> Actually... are you sure someone didn't just slip while editing cluster.conf? 
>  [...].1251 does not look like a valid IP :)

In the end this fixed it

# pcs cluster cib > /tmp/cib-tmp.xml
# vi /tmp/cib-tmp.xml # remove bad node
# pcs cluster push cib /tmp/cib-tmp.xml

Followed by restaring pacemaker and cman on both nodes. The ghost node 
disappeared, so it was cached as you mentioned.

I also tracked the bad IP down to bad non-printing characters in the initial 
command line while configuring the fence_ec2 stonith device. I'd put the 
command together from the github README and some mailing list posts and laid it 
out in an external editor. Go me. :)

> 
> 
>>>> Version: 1.1.8-7.el6-394e906
> 
> There is now an update to 1.1.10 available for 6.4, that _may_ help in the 
> future.

That's my next task. I believe I'm hitting the failure-timeout not clearing 
failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker 
after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, 
should I update that as well?

> 
>> 
>> I may have to go back to the drawing board on a fencing device for the 
>> nodes. Are there any other recommendations for a cluster on EC2 nodes?
>> 
>> Thanks very much
>> 
>>> 
>>>> 
>>>> # pcs status
>>>> Last updated: Thu Nov  7 17:41:21 2013
>>>> Last change: Thu Nov  7 04:29:06 2013 via cibadmin on ip-10-50-3-122
>>>> Stack: cman
>>>> Current DC: ip-10-50-3-122 - partition with quorum
>>>> Version: 1.1.8-7.el6-394e906
>>>> 3 Nodes configured, unknown expected votes
>>>> 11 Resources configured.
>>>> 
>>>> 
>>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>> 
>>>> Full list of resources:
>>>> 
>>>> ClusterEIP_54.215.143.166      (ocf::pacemaker:EIP):   Started 
>>>> ip-10-50-3-122
>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>>   Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>   Stopped: [ EIP-AND-VARNISH:2 ]
>>>> ec2-fencing    (stonith:fence_ec2):    Stopped 
>>>> 
>>>> I have no idea where the node that is marked UNCLEAN came from, though 
>>>> it's a clear typo is a proper cluster node.
>>>> 
>>>> The only command I ran with the bad node ID was:
>>>> 
>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node 
>>>> ip-10-50-3-1251
>>>> 
>>>> Is there any possible way that could have caused the the node to be added?
>>>> 
>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is 
>>>> no node and thus no pcsd that failed. Is there a way I can safely remove 
>>>> this ghost node from the cluster? I can provide logs from pacemaker or 
>>>> corosync as needed.
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to