On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> > On 8 Nov 2013, at 12:59 pm, Sean Lutner <s...@rentul.net> wrote: > >> >> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> wrote: >> >>> >>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote: >>> >>>> I have a confusing situation that I'm hoping to get help with. Last night >>>> after configuring STONITH on my two node cluster, I suddenly have a >>>> "ghost" node in my cluster. I'm looking to understand the best way to >>>> remove this node from the config. >>>> >>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on >>>> each node, registered the device with stonith_admin -R -a fence_ec2 and >>>> confirmed the registration with both >>>> >>>> # stonith_admin -I >>>> # pcs stonith list >>>> >>>> I then configured STONITH per the Clusters from Scratch doc >>>> >>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html >>>> >>>> Here are my commands: >>>> # pcs cluster cib stonith_cfg >>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 >>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" >>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" >>>> timeout="150s" op start start-delay="30s" interval="0" >>>> # pcs -f stonith_cfg stonith >>>> # pcs -f stonith_cfg property set stonith-enabled=true >>>> # pcs -f stonith_cfg property >>>> # pcs cluster push cib stonith_cfg >>>> >>>> After that I saw that STONITH appears to be functioning but a new node >>>> listed in pcs status output: >>> >>> Do the EC2 instances have fixed IPs? >>> I didn't have much luck with EC2 because every time they came back up it >>> was with a new name/address which confused corosync and created situations >>> like this. >> >> The IPs persist across reboots as far as I can tell. I thought the problem >> was due to stonith being enabled but not working so I removed the stonith_id >> and disabled stonith. After that I restarted pacemaker and cman on both >> nodes and things started as expected but the ghost node it still there. >> >> Someone else working on the cluster exported the CIB, removed the node and >> then imported the CIB. They used this process >> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html >> >> Even after that, the ghost node is still there? Would pcs cluster cib > >> /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after >> editing the node out of the config? > > No. If its coming back then pacemaker is holding it in one of its internal > caches. > The only way to clear it out in your version is to restart pacemaker on the > DC. > > Actually... are you sure someone didn't just slip while editing cluster.conf? > [...].1251 does not look like a valid IP :) In the end this fixed it # pcs cluster cib > /tmp/cib-tmp.xml # vi /tmp/cib-tmp.xml # remove bad node # pcs cluster push cib /tmp/cib-tmp.xml Followed by restaring pacemaker and cman on both nodes. The ghost node disappeared, so it was cached as you mentioned. I also tracked the bad IP down to bad non-printing characters in the initial command line while configuring the fence_ec2 stonith device. I'd put the command together from the github README and some mailing list posts and laid it out in an external editor. Go me. :) > > >>>> Version: 1.1.8-7.el6-394e906 > > There is now an update to 1.1.10 available for 6.4, that _may_ help in the > future. That's my next task. I believe I'm hitting the failure-timeout not clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, should I update that as well? > >> >> I may have to go back to the drawing board on a fencing device for the >> nodes. Are there any other recommendations for a cluster on EC2 nodes? >> >> Thanks very much >> >>> >>>> >>>> # pcs status >>>> Last updated: Thu Nov 7 17:41:21 2013 >>>> Last change: Thu Nov 7 04:29:06 2013 via cibadmin on ip-10-50-3-122 >>>> Stack: cman >>>> Current DC: ip-10-50-3-122 - partition with quorum >>>> Version: 1.1.8-7.el6-394e906 >>>> 3 Nodes configured, unknown expected votes >>>> 11 Resources configured. >>>> >>>> >>>> Node ip-10-50-3-1251: UNCLEAN (offline) >>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ] >>>> >>>> Full list of resources: >>>> >>>> ClusterEIP_54.215.143.166 (ocf::pacemaker:EIP): Started >>>> ip-10-50-3-122 >>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH] >>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ] >>>> Stopped: [ EIP-AND-VARNISH:2 ] >>>> ec2-fencing (stonith:fence_ec2): Stopped >>>> >>>> I have no idea where the node that is marked UNCLEAN came from, though >>>> it's a clear typo is a proper cluster node. >>>> >>>> The only command I ran with the bad node ID was: >>>> >>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node >>>> ip-10-50-3-1251 >>>> >>>> Is there any possible way that could have caused the the node to be added? >>>> >>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is >>>> no node and thus no pcsd that failed. Is there a way I can safely remove >>>> this ghost node from the cluster? I can provide logs from pacemaker or >>>> corosync as needed. >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org