> On Nov 14, 2013, at 6:47 PM, Andrew Beekhof <and...@beekhof.net> wrote: > > >> On 15 Nov 2013, at 10:24 am, Sean Lutner <s...@rentul.net> wrote: >> >> >>> On Nov 14, 2013, at 6:14 PM, Andrew Beekhof <and...@beekhof.net> wrote: >>> >>> >>>> On 14 Nov 2013, at 2:55 pm, Sean Lutner <s...@rentul.net> wrote: >>>> >>>> >>>>> On Nov 13, 2013, at 10:51 PM, Andrew Beekhof <and...@beekhof.net> wrote: >>>>> >>>>> >>>>>> On 14 Nov 2013, at 1:12 pm, Sean Lutner <s...@rentul.net> wrote: >>>>>> >>>>>> >>>>>>> On Nov 10, 2013, at 8:03 PM, Sean Lutner <s...@rentul.net> wrote: >>>>>>> >>>>>>> >>>>>>>> On Nov 10, 2013, at 7:54 PM, Andrew Beekhof <and...@beekhof.net> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On 11 Nov 2013, at 11:44 am, Sean Lutner <s...@rentul.net> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <and...@beekhof.net> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 8 Nov 2013, at 12:59 pm, Sean Lutner <s...@rentul.net> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I have a confusing situation that I'm hoping to get help with. >>>>>>>>>>>>> Last night after configuring STONITH on my two node cluster, I >>>>>>>>>>>>> suddenly have a "ghost" node in my cluster. I'm looking to >>>>>>>>>>>>> understand the best way to remove this node from the config. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm using the fence_ec2 device for for STONITH. I dropped the >>>>>>>>>>>>> script on each node, registered the device with stonith_admin -R >>>>>>>>>>>>> -a fence_ec2 and confirmed the registration with both >>>>>>>>>>>>> >>>>>>>>>>>>> # stonith_admin -I >>>>>>>>>>>>> # pcs stonith list >>>>>>>>>>>>> >>>>>>>>>>>>> I then configured STONITH per the Clusters from Scratch doc >>>>>>>>>>>>> >>>>>>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html >>>>>>>>>>>>> >>>>>>>>>>>>> Here are my commands: >>>>>>>>>>>>> # pcs cluster cib stonith_cfg >>>>>>>>>>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 >>>>>>>>>>>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" >>>>>>>>>>>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor >>>>>>>>>>>>> interval="300s" timeout="150s" op start start-delay="30s" >>>>>>>>>>>>> interval="0" >>>>>>>>>>>>> # pcs -f stonith_cfg stonith >>>>>>>>>>>>> # pcs -f stonith_cfg property set stonith-enabled=true >>>>>>>>>>>>> # pcs -f stonith_cfg property >>>>>>>>>>>>> # pcs cluster push cib stonith_cfg >>>>>>>>>>>>> >>>>>>>>>>>>> After that I saw that STONITH appears to be functioning but a new >>>>>>>>>>>>> node listed in pcs status output: >>>>>>>>>>>> >>>>>>>>>>>> Do the EC2 instances have fixed IPs? >>>>>>>>>>>> I didn't have much luck with EC2 because every time they came back >>>>>>>>>>>> up it was with a new name/address which confused corosync and >>>>>>>>>>>> created situations like this. >>>>>>>>>>> >>>>>>>>>>> The IPs persist across reboots as far as I can tell. I thought the >>>>>>>>>>> problem was due to stonith being enabled but not working so I >>>>>>>>>>> removed the stonith_id and disabled stonith. After that I restarted >>>>>>>>>>> pacemaker and cman on both nodes and things started as expected but >>>>>>>>>>> the ghost node it still there. >>>>>>>>>>> >>>>>>>>>>> Someone else working on the cluster exported the CIB, removed the >>>>>>>>>>> node and then imported the CIB. They used this process >>>>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html >>>>>>>>>>> >>>>>>>>>>> Even after that, the ghost node is still there? Would pcs cluster >>>>>>>>>>> cib > /tmp/cib-temp.xml and then pcs cluster push cib >>>>>>>>>>> /tmp/cib-temp.xml after editing the node out of the config? >>>>>>>>>> >>>>>>>>>> No. If its coming back then pacemaker is holding it in one of its >>>>>>>>>> internal caches. >>>>>>>>>> The only way to clear it out in your version is to restart pacemaker >>>>>>>>>> on the DC. >>>>>>>>>> >>>>>>>>>> Actually... are you sure someone didn't just slip while editing >>>>>>>>>> cluster.conf? [...].1251 does not look like a valid IP :) >>>>>>>>> >>>>>>>>> In the end this fixed it >>>>>>>>> >>>>>>>>> # pcs cluster cib > /tmp/cib-tmp.xml >>>>>>>>> # vi /tmp/cib-tmp.xml # remove bad node >>>>>>>>> # pcs cluster push cib /tmp/cib-tmp.xml >>>>>>>>> >>>>>>>>> Followed by restaring pacemaker and cman on both nodes. The ghost >>>>>>>>> node disappeared, so it was cached as you mentioned. >>>>>>>>> >>>>>>>>> I also tracked the bad IP down to bad non-printing characters in the >>>>>>>>> initial command line while configuring the fence_ec2 stonith device. >>>>>>>>> I'd put the command together from the github README and some mailing >>>>>>>>> list posts and laid it out in an external editor. Go me. :) >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906 >>>>>>>>>> >>>>>>>>>> There is now an update to 1.1.10 available for 6.4, that _may_ help >>>>>>>>>> in the future. >>>>>>>>> >>>>>>>>> That's my next task. I believe I'm hitting the failure-timeout not >>>>>>>>> clearing failcount bug and want to upgrade to 1.1.10. Is it safe to >>>>>>>>> yum update pacemaker after stopping the cluster? I see there is also >>>>>>>>> an updated pcs in CentOS 6.4, should I update that as well? >>>>>>>> >>>>>>>> yes and yes >>>>>>>> >>>>>>>> you might want to check if you're using any OCF resource agents that >>>>>>>> didn't make it into the first supported release though. >>>>>>>> >>>>>>>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/ >>>>>>> >>>>>>> Thanks, I'll give that a read. All the resource agents are custom so >>>>>>> I'm thinking I'm okay (I'll back them up before upgrading). >>>>>>> >>>>>>> One last question related to the fence_ec2 script. Should crm_mon -VW >>>>>>> show it running on both nodes or just one? >>>>>> >>>>>> I just went through the upgrade to pacemaker 1.1.10 and pcs. After >>>>>> running the yum update for those I ran a crm_verify and I'm seeing >>>>>> errors related to my order and colocation constraints. Did the behavior >>>>>> of these change from 1.1.8 to 1.1.10? >>>>>> >>>>>> # crm_verify -L -V >>>>>> error: unpack_order_template: Invalid constraint >>>>>> 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or >>>>>> template named 'Varnish' >>>>> >>>>> Is that true? >>>> >>>> No, it's not. The resource exists and the script for the resource exists. >>>> >>>> I rolled back to 1.1.8 and the cluster started up without issue. >>> >>> Can you send us your config? (cibadmin -Ql) >>> >>> Is Varnish in a group or cloned? That might also explain things. >> >> The cibadmin output is attached. >> >> Yes the varnish resources are in a group which is then cloned. > > -EDONTDOTHAT > > You cant refer to the things inside a clone. > 1.1.8 will have just been ignoring those constraints.
So the implicit order and colocation constraints in a group and clone will take care of those? Which means remove the constraints and retry the upgrade? > >> >> <cluster-config.out> >> >> >>> >>>> >>>>> >>>>>> error: unpack_order_template: Invalid constraint >>>>>> 'order-Varnish-Varnishlog-mandatory': No resource or template named >>>>>> 'Varnish' >>>>>> error: unpack_order_template: Invalid constraint >>>>>> 'order-Varnishlog-Varnishncsa-mandatory': No resource or template named >>>>>> 'Varnishlog' >>>>>> error: unpack_colocation_template: Invalid constraint >>>>>> 'colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY': No resource or >>>>>> template named 'Varnish' >>>>>> error: unpack_colocation_template: Invalid constraint >>>>>> 'colocation-Varnishlog-Varnish-INFINITY': No resource or template named >>>>>> 'Varnishlog' >>>>>> error: unpack_colocation_template: Invalid constraint >>>>>> 'colocation-Varnishncsa-Varnishlog-INFINITY': No resource or template >>>>>> named 'Varnishncsa' >>>>>> Errors found during check: config not valid >>>>>> >>>>>> The cluster doesn't start. I'd prefer to figure out how to fix this >>>>>> rather than roll back to 1.1.8. Any help is appreciated. >>>>>> >>>>>> Thanks >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I may have to go back to the drawing board on a fencing device for >>>>>>>>>>> the nodes. Are there any other recommendations for a cluster on EC2 >>>>>>>>>>> nodes? >>>>>>>>>>> >>>>>>>>>>> Thanks very much >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> # pcs status >>>>>>>>>>>>> Last updated: Thu Nov 7 17:41:21 2013 >>>>>>>>>>>>> Last change: Thu Nov 7 04:29:06 2013 via cibadmin on >>>>>>>>>>>>> ip-10-50-3-122 >>>>>>>>>>>>> Stack: cman >>>>>>>>>>>>> Current DC: ip-10-50-3-122 - partition with quorum >>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906 >>>>>>>>>>>>> 3 Nodes configured, unknown expected votes >>>>>>>>>>>>> 11 Resources configured. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Node ip-10-50-3-1251: UNCLEAN (offline) >>>>>>>>>>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ] >>>>>>>>>>>>> >>>>>>>>>>>>> Full list of resources: >>>>>>>>>>>>> >>>>>>>>>>>>> ClusterEIP_54.215.143.166 (ocf::pacemaker:EIP): Started >>>>>>>>>>>>> ip-10-50-3-122 >>>>>>>>>>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH] >>>>>>>>>>>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ] >>>>>>>>>>>>> Stopped: [ EIP-AND-VARNISH:2 ] >>>>>>>>>>>>> ec2-fencing (stonith:fence_ec2): Stopped >>>>>>>>>>>>> >>>>>>>>>>>>> I have no idea where the node that is marked UNCLEAN came from, >>>>>>>>>>>>> though it's a clear typo is a proper cluster node. >>>>>>>>>>>>> >>>>>>>>>>>>> The only command I ran with the bad node ID was: >>>>>>>>>>>>> >>>>>>>>>>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup >>>>>>>>>>>>> --node ip-10-50-3-1251 >>>>>>>>>>>>> >>>>>>>>>>>>> Is there any possible way that could have caused the the node to >>>>>>>>>>>>> be added? >>>>>>>>>>>>> >>>>>>>>>>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since >>>>>>>>>>>>> there is no node and thus no pcsd that failed. Is there a way I >>>>>>>>>>>>> can safely remove this ghost node from the cluster? I can provide >>>>>>>>>>>>> logs from pacemaker or corosync as needed. >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>>> >>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>> Getting started: >>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>> >>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>> Getting started: >>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>> >>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>> Getting started: >>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>> >>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>> Getting started: >>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>> Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org