On Nov 7, 2013, at 8:59 PM, Sean Lutner <s...@rentul.net> wrote:

> 
> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
>> 
>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote:
>> 
>>> I have a confusing situation that I'm hoping to get help with. Last night 
>>> after configuring STONITH on my two node cluster, I suddenly have a "ghost" 
>>> node in my cluster. I'm looking to understand the best way to remove this 
>>> node from the config.
>>> 
>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on 
>>> each node, registered the device with stonith_admin -R -a fence_ec2 and 
>>> confirmed the registration with both
>>> 
>>> # stonith_admin -I
>>> # pcs stonith list
>>> 
>>> I then configured STONITH per the Clusters from Scratch doc
>>> 
>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>> 
>>> Here are my commands:
>>> # pcs cluster cib stonith_cfg
>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" 
>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" 
>>> timeout="150s" op start start-delay="30s" interval="0"
>>> # pcs -f stonith_cfg stonith
>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>> # pcs -f stonith_cfg property
>>> # pcs cluster push cib stonith_cfg
>>> 
>>> After that I saw that STONITH appears to be functioning but a new node 
>>> listed in pcs status output:
>> 
>> Do the EC2 instances have fixed IPs?
>> I didn't have much luck with EC2 because every time they came back up it was 
>> with a new name/address which confused corosync and created situations like 
>> this.
> 
> The IPs persist across reboots as far as I can tell. I thought the problem 
> was due to stonith being enabled but not working so I removed the stonith_id 
> and disabled stonith. After that I restarted pacemaker and cman on both nodes 
> and things started as expected but the ghost node it still there. 
> 
> Someone else working on the cluster exported the CIB, removed the node and 
> then imported the CIB. They used this process 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
> 
> Even after that, the ghost node is still there? Would pcs cluster cib > 
> /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after 
> editing the node out of the config?
> 
> I may have to go back to the drawing board on a fencing device for the nodes. 
> Are there any other recommendations for a cluster on EC2 nodes?
> 
> Thanks very much

Some addition detail from the logs. This is from last night when I added the 
fencing. After I ran the commands above, this was in the logs. I don't see 1251 
in any commands I ran and that node ID doesn't show up in instance or tag data 
in ec2. I'm really confused by this.

Nov  7 03:52:37 ip-10-50-3-122 cibadmin[31146]:   notice: crm_log_args: 
Invoked: /usr/sbin/cibadmin -o resources -C -X <primitive class="stonith" 
id="ec2-fencing" type="fence_ec2"><instance_attributes 
id="ec2-fencing-instance_attributes"><nvpair 
id="ec2-fencing-instance_attributes-ec2-home" name="ec2-home" 
value="/opt/ec2-api-tools"/><nvpair 
id="ec2-fencing-instance_attributes-pcmk_host_check" name="pcmk_host_check" 
value="static-list"/><nvpair 
id="ec2-fencing-instance_attributes-pcmk_host_list" name="pcmk_host_list" 
value="ip-10-50-3-122 ip-10-50-3-251
Nov  7 03:52:41 ip-10-50-3-122 lrmd[18588]:   notice: operation_finished: 
ClusterEIP_54.215.143.166_monitor_30000:31096 [ 2013/11/07_03:52:41 INFO: 
54.215.143.166 is here ]
Nov  7 03:53:14 ip-10-50-3-122 cibadmin[31311]:   notice: crm_log_args: 
Invoked: /usr/sbin/cibadmin -Q --xpath //crm_config 
Nov  7 03:53:14 ip-10-50-3-122 cibadmin[31312]:   notice: crm_log_args: 
Invoked: /usr/sbin/cibadmin -c -R --xml-text <cluster_property_set 
id="cib-bootstrap-options">#012    <nvpair 
id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.1.8-7.el6-394e906"/>#012    <nvpair 
id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" 
value="cman"/>#012    <nvpair id="cib-bootstrap-options-last-lrm-refresh" 
name="last-lrm-refresh" value="1383790849"/>#012    #012    <nvpair 
id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" v
Nov  7 03:53:19 ip-10-50-3-122 lrmd[18588]:   notice: operation_finished: 
ClusterEIP_54.215.143.166_monitor_30000:31281 [ 2013/11/07_03:53:19 INFO: 
54.215.143.166 is here ]
Nov  7 03:53:28 ip-10-50-3-122 cibadmin[31399]:   notice: crm_log_args: 
Invoked: /usr/sbin/cibadmin -Q --scope crm_config 
Nov  7 03:53:41 ip-10-50-3-122 cibadmin[31430]:   notice: crm_log_args: 
Invoked: /usr/sbin/cibadmin --replace --xml-file stonith_cfg 
Nov  7 03:53:41 ip-10-50-3-122 crmd[18591]:   notice: do_state_transition: 
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: Diff: --- 
1.1181.7
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: Diff: +++ 
1.1184.1 9ecc39408f9be3e1137a0a574fc9df33
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: --         
<nvpair value="false" id="cib-bootstrap-options-stonith-enabled" />
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++         
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
value="true" />
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++       
<primitive class="stonith" id="ec2-fencing" type="fence_ec2" >
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++         
<instance_attributes id="ec2-fencing-instance_attributes" >
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++           
<nvpair id="ec2-fencing-instance_attributes-ec2-home" name="ec2-home" 
value="/opt/ec2-api-tools" />
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++           
<nvpair id="ec2-fencing-instance_attributes-pcmk_host_check" 
name="pcmk_host_check" value="static-list" />
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++           
<nvpair id="ec2-fencing-instance_attributes-pcmk_host_list" 
name="pcmk_host_list" value="ip-10-50-3-122 ip-10-50-3-251" />
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++         
</instance_attributes>
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++         
<operations >
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++           <op 
id="ec2-fencing-interval-0" interval="0" name="monitor" start-delay="30s" 
timeout="150s" />
Nov  7 03:53:41 ip-10-50-3-122 crmd[18591]:   notice: do_state_transition: 
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++         
</operations>
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++       
</primitive>
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: log_cib_diff: cib:diff: 
Local-only Change: 1.1185.1
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: -- <cib 
admin_epoch="1" epoch="1184" num_updates="1" />
Nov  7 03:53:41 ip-10-50-3-122 cib[18586]:   notice: cib:diff: ++       <node 
id="ip-10-50-3-1251" uname="ip-10-50-3-1251" />

> 
>> 
>>> 
>>> # pcs status
>>> Last updated: Thu Nov  7 17:41:21 2013
>>> Last change: Thu Nov  7 04:29:06 2013 via cibadmin on ip-10-50-3-122
>>> Stack: cman
>>> Current DC: ip-10-50-3-122 - partition with quorum
>>> Version: 1.1.8-7.el6-394e906
>>> 3 Nodes configured, unknown expected votes
>>> 11 Resources configured.
>>> 
>>> 
>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>> 
>>> Full list of resources:
>>> 
>>> ClusterEIP_54.215.143.166      (ocf::pacemaker:EIP):   Started 
>>> ip-10-50-3-122
>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>    Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>    Stopped: [ EIP-AND-VARNISH:2 ]
>>> ec2-fencing    (stonith:fence_ec2):    Stopped 
>>> 
>>> I have no idea where the node that is marked UNCLEAN came from, though it's 
>>> a clear typo is a proper cluster node.
>>> 
>>> The only command I ran with the bad node ID was:
>>> 
>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node 
>>> ip-10-50-3-1251
>>> 
>>> Is there any possible way that could have caused the the node to be added?
>>> 
>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is 
>>> no node and thus no pcsd that failed. Is there a way I can safely remove 
>>> this ghost node from the cluster? I can provide logs from pacemaker or 
>>> corosync as needed.
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to