Re: [Pacemaker] Remove a "ghost" node

Sean Lutner Thu, 14 Nov 2013 16:25:25 -0800


> On Nov 14, 2013, at 6:47 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
> 
>> On 15 Nov 2013, at 10:24 am, Sean Lutner <s...@rentul.net> wrote:
>> 
>> 
>>> On Nov 14, 2013, at 6:14 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>> 
>>> 
>>>> On 14 Nov 2013, at 2:55 pm, Sean Lutner <s...@rentul.net> wrote:
>>>> 
>>>> 
>>>>> On Nov 13, 2013, at 10:51 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>>>> 
>>>>> 
>>>>>> On 14 Nov 2013, at 1:12 pm, Sean Lutner <s...@rentul.net> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 10, 2013, at 8:03 PM, Sean Lutner <s...@rentul.net> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Nov 10, 2013, at 7:54 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 11 Nov 2013, at 11:44 am, Sean Lutner <s...@rentul.net> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <and...@beekhof.net> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 8 Nov 2013, at 12:59 pm, Sean Lutner <s...@rentul.net> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a confusing situation that I'm hoping to get help with. 
>>>>>>>>>>>>> Last night after configuring STONITH on my two node cluster, I 
>>>>>>>>>>>>> suddenly have a "ghost" node in my cluster. I'm looking to 
>>>>>>>>>>>>> understand the best way to remove this node from the config.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm using the fence_ec2 device for for STONITH. I dropped the 
>>>>>>>>>>>>> script on each node, registered the device with stonith_admin -R 
>>>>>>>>>>>>> -a fence_ec2 and confirmed the registration with both
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # stonith_admin -I
>>>>>>>>>>>>> # pcs stonith list
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I then configured STONITH per the Clusters from Scratch doc
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are my commands:
>>>>>>>>>>>>> # pcs cluster cib stonith_cfg
>>>>>>>>>>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
>>>>>>>>>>>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" 
>>>>>>>>>>>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor 
>>>>>>>>>>>>> interval="300s" timeout="150s" op start start-delay="30s" 
>>>>>>>>>>>>> interval="0"
>>>>>>>>>>>>> # pcs -f stonith_cfg stonith
>>>>>>>>>>>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>>>>>>>>>>>> # pcs -f stonith_cfg property
>>>>>>>>>>>>> # pcs cluster push cib stonith_cfg
>>>>>>>>>>>>> 
>>>>>>>>>>>>> After that I saw that STONITH appears to be functioning but a new 
>>>>>>>>>>>>> node listed in pcs status output:
>>>>>>>>>>>> 
>>>>>>>>>>>> Do the EC2 instances have fixed IPs?
>>>>>>>>>>>> I didn't have much luck with EC2 because every time they came back 
>>>>>>>>>>>> up it was with a new name/address which confused corosync and 
>>>>>>>>>>>> created situations like this.
>>>>>>>>>>> 
>>>>>>>>>>> The IPs persist across reboots as far as I can tell. I thought the 
>>>>>>>>>>> problem was due to stonith being enabled but not working so I 
>>>>>>>>>>> removed the stonith_id and disabled stonith. After that I restarted 
>>>>>>>>>>> pacemaker and cman on both nodes and things started as expected but 
>>>>>>>>>>> the ghost node it still there. 
>>>>>>>>>>> 
>>>>>>>>>>> Someone else working on the cluster exported the CIB, removed the 
>>>>>>>>>>> node and then imported the CIB. They used this process 
>>>>>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
>>>>>>>>>>> 
>>>>>>>>>>> Even after that, the ghost node is still there? Would pcs cluster 
>>>>>>>>>>> cib > /tmp/cib-temp.xml and then pcs cluster push cib 
>>>>>>>>>>> /tmp/cib-temp.xml after editing the node out of the config?
>>>>>>>>>> 
>>>>>>>>>> No. If its coming back then pacemaker is holding it in one of its 
>>>>>>>>>> internal caches.
>>>>>>>>>> The only way to clear it out in your version is to restart pacemaker 
>>>>>>>>>> on the DC.
>>>>>>>>>> 
>>>>>>>>>> Actually... are you sure someone didn't just slip while editing 
>>>>>>>>>> cluster.conf?  [...].1251 does not look like a valid IP :)
>>>>>>>>> 
>>>>>>>>> In the end this fixed it
>>>>>>>>> 
>>>>>>>>> # pcs cluster cib > /tmp/cib-tmp.xml
>>>>>>>>> # vi /tmp/cib-tmp.xml # remove bad node
>>>>>>>>> # pcs cluster push cib /tmp/cib-tmp.xml
>>>>>>>>> 
>>>>>>>>> Followed by restaring pacemaker and cman on both nodes. The ghost 
>>>>>>>>> node disappeared, so it was cached as you mentioned.
>>>>>>>>> 
>>>>>>>>> I also tracked the bad IP down to bad non-printing characters in the 
>>>>>>>>> initial command line while configuring the fence_ec2 stonith device. 
>>>>>>>>> I'd put the command together from the github README and some mailing 
>>>>>>>>> list posts and laid it out in an external editor. Go me. :)
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>>>>> 
>>>>>>>>>> There is now an update to 1.1.10 available for 6.4, that _may_ help 
>>>>>>>>>> in the future.
>>>>>>>>> 
>>>>>>>>> That's my next task. I believe I'm hitting the failure-timeout not 
>>>>>>>>> clearing failcount bug and want to upgrade to 1.1.10. Is it safe to 
>>>>>>>>> yum update pacemaker after stopping the cluster? I see there is also 
>>>>>>>>> an updated pcs in CentOS 6.4, should I update that as well?
>>>>>>>> 
>>>>>>>> yes and yes
>>>>>>>> 
>>>>>>>> you might want to check if you're using any OCF resource agents that 
>>>>>>>> didn't make it into the first supported release though.
>>>>>>>> 
>>>>>>>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>>>>>>> 
>>>>>>> Thanks, I'll give that a read. All the resource agents are custom so 
>>>>>>> I'm thinking I'm okay (I'll back them up before upgrading). 
>>>>>>> 
>>>>>>> One last question related to the fence_ec2 script. Should crm_mon -VW 
>>>>>>> show it running on both nodes or just one?
>>>>>> 
>>>>>> I just went through the upgrade to pacemaker 1.1.10 and pcs. After 
>>>>>> running the yum update for those I ran a crm_verify and I'm seeing 
>>>>>> errors related to my order and colocation constraints. Did the behavior 
>>>>>> of these change from 1.1.8 to 1.1.10?
>>>>>> 
>>>>>> # crm_verify -L -V
>>>>>> error: unpack_order_template:        Invalid constraint 
>>>>>> 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or 
>>>>>> template named 'Varnish'
>>>>> 
>>>>> Is that true?
>>>> 
>>>> No, it's not. The resource exists and the script for the resource exists.
>>>> 
>>>> I rolled back to 1.1.8 and the cluster started up without issue.
>>> 
>>> Can you send us your config? (cibadmin -Ql)
>>> 
>>> Is Varnish in a group or cloned?  That might also explain things.
>> 
>> The cibadmin output is attached.
>> 
>> Yes the varnish resources are in a group which is then cloned.
> 
> -EDONTDOTHAT
> 
> You cant refer to the things inside a clone.
> 1.1.8 will have just been ignoring those constraints.


So the implicit order and colocation constraints in a group and clone will take 
care of those?

Which means remove the constraints and retry the upgrade?

> 
>> 
>> <cluster-config.out>
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> error: unpack_order_template:        Invalid constraint 
>>>>>> 'order-Varnish-Varnishlog-mandatory': No resource or template named 
>>>>>> 'Varnish'
>>>>>> error: unpack_order_template:        Invalid constraint 
>>>>>> 'order-Varnishlog-Varnishncsa-mandatory': No resource or template named 
>>>>>> 'Varnishlog'
>>>>>> error: unpack_colocation_template:   Invalid constraint 
>>>>>> 'colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY': No resource or 
>>>>>> template named 'Varnish'
>>>>>> error: unpack_colocation_template:   Invalid constraint 
>>>>>> 'colocation-Varnishlog-Varnish-INFINITY': No resource or template named 
>>>>>> 'Varnishlog'
>>>>>> error: unpack_colocation_template:   Invalid constraint 
>>>>>> 'colocation-Varnishncsa-Varnishlog-INFINITY': No resource or template 
>>>>>> named 'Varnishncsa'
>>>>>> Errors found during check: config not valid
>>>>>> 
>>>>>> The cluster doesn't start. I'd prefer to figure out how to fix this 
>>>>>> rather than roll back to 1.1.8. Any help is appreciated.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I may have to go back to the drawing board on a fencing device for 
>>>>>>>>>>> the nodes. Are there any other recommendations for a cluster on EC2 
>>>>>>>>>>> nodes?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks very much
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # pcs status
>>>>>>>>>>>>> Last updated: Thu Nov  7 17:41:21 2013
>>>>>>>>>>>>> Last change: Thu Nov  7 04:29:06 2013 via cibadmin on 
>>>>>>>>>>>>> ip-10-50-3-122
>>>>>>>>>>>>> Stack: cman
>>>>>>>>>>>>> Current DC: ip-10-50-3-122 - partition with quorum
>>>>>>>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>>>>>>>> 3 Nodes configured, unknown expected votes
>>>>>>>>>>>>> 11 Resources configured.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>>>>>>>>>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Full list of resources:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ClusterEIP_54.215.143.166      (ocf::pacemaker:EIP):   Started 
>>>>>>>>>>>>> ip-10-50-3-122
>>>>>>>>>>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>>>>>>>>>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>>>>>>>> Stopped: [ EIP-AND-VARNISH:2 ]
>>>>>>>>>>>>> ec2-fencing    (stonith:fence_ec2):    Stopped 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have no idea where the node that is marked UNCLEAN came from, 
>>>>>>>>>>>>> though it's a clear typo is a proper cluster node.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The only command I ran with the bad node ID was:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup 
>>>>>>>>>>>>> --node ip-10-50-3-1251
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is there any possible way that could have caused the the node to 
>>>>>>>>>>>>> be added?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since 
>>>>>>>>>>>>> there is no node and thus no pcsd that failed. Is there a way I 
>>>>>>>>>>>>> can safely remove this ghost node from the cluster? I can provide 
>>>>>>>>>>>>> logs from pacemaker or corosync as needed.
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started: 
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>> 
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started: 
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>> 
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started: 
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>> 
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started: 
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>> 
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started: 
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: 
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Remove a "ghost" node

Reply via email to