On Nov 7, 2013, at 8:59 PM, Sean Lutner <s...@rentul.net> wrote: > > On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> wrote: > >> >> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote: >> >>> I have a confusing situation that I'm hoping to get help with. Last night >>> after configuring STONITH on my two node cluster, I suddenly have a "ghost" >>> node in my cluster. I'm looking to understand the best way to remove this >>> node from the config. >>> >>> I'm using the fence_ec2 device for for STONITH. I dropped the script on >>> each node, registered the device with stonith_admin -R -a fence_ec2 and >>> confirmed the registration with both >>> >>> # stonith_admin -I >>> # pcs stonith list >>> >>> I then configured STONITH per the Clusters from Scratch doc >>> >>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html >>> >>> Here are my commands: >>> # pcs cluster cib stonith_cfg >>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 >>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" >>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor interval="300s" >>> timeout="150s" op start start-delay="30s" interval="0" >>> # pcs -f stonith_cfg stonith >>> # pcs -f stonith_cfg property set stonith-enabled=true >>> # pcs -f stonith_cfg property >>> # pcs cluster push cib stonith_cfg >>> >>> After that I saw that STONITH appears to be functioning but a new node >>> listed in pcs status output: >> >> Do the EC2 instances have fixed IPs? >> I didn't have much luck with EC2 because every time they came back up it was >> with a new name/address which confused corosync and created situations like >> this. > > The IPs persist across reboots as far as I can tell. I thought the problem > was due to stonith being enabled but not working so I removed the stonith_id > and disabled stonith. After that I restarted pacemaker and cman on both nodes > and things started as expected but the ghost node it still there. > > Someone else working on the cluster exported the CIB, removed the node and > then imported the CIB. They used this process > http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html > > Even after that, the ghost node is still there? Would pcs cluster cib > > /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after > editing the node out of the config? > > I may have to go back to the drawing board on a fencing device for the nodes. > Are there any other recommendations for a cluster on EC2 nodes? > > Thanks very much
Some addition detail from the logs. This is from last night when I added the fencing. After I ran the commands above, this was in the logs. I don't see 1251 in any commands I ran and that node ID doesn't show up in instance or tag data in ec2. I'm really confused by this. Nov 7 03:52:37 ip-10-50-3-122 cibadmin[31146]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o resources -C -X <primitive class="stonith" id="ec2-fencing" type="fence_ec2"><instance_attributes id="ec2-fencing-instance_attributes"><nvpair id="ec2-fencing-instance_attributes-ec2-home" name="ec2-home" value="/opt/ec2-api-tools"/><nvpair id="ec2-fencing-instance_attributes-pcmk_host_check" name="pcmk_host_check" value="static-list"/><nvpair id="ec2-fencing-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="ip-10-50-3-122 ip-10-50-3-251 Nov 7 03:52:41 ip-10-50-3-122 lrmd[18588]: notice: operation_finished: ClusterEIP_54.215.143.166_monitor_30000:31096 [ 2013/11/07_03:52:41 INFO: 54.215.143.166 is here ] Nov 7 03:53:14 ip-10-50-3-122 cibadmin[31311]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -Q --xpath //crm_config Nov 7 03:53:14 ip-10-50-3-122 cibadmin[31312]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -c -R --xml-text <cluster_property_set id="cib-bootstrap-options">#012 <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.el6-394e906"/>#012 <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>#012 <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1383790849"/>#012 #012 <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" v Nov 7 03:53:19 ip-10-50-3-122 lrmd[18588]: notice: operation_finished: ClusterEIP_54.215.143.166_monitor_30000:31281 [ 2013/11/07_03:53:19 INFO: 54.215.143.166 is here ] Nov 7 03:53:28 ip-10-50-3-122 cibadmin[31399]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -Q --scope crm_config Nov 7 03:53:41 ip-10-50-3-122 cibadmin[31430]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin --replace --xml-file stonith_cfg Nov 7 03:53:41 ip-10-50-3-122 crmd[18591]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: Diff: --- 1.1181.7 Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: Diff: +++ 1.1184.1 9ecc39408f9be3e1137a0a574fc9df33 Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: -- <nvpair value="false" id="cib-bootstrap-options-stonith-enabled" /> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true" /> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <primitive class="stonith" id="ec2-fencing" type="fence_ec2" > Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <instance_attributes id="ec2-fencing-instance_attributes" > Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <nvpair id="ec2-fencing-instance_attributes-ec2-home" name="ec2-home" value="/opt/ec2-api-tools" /> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <nvpair id="ec2-fencing-instance_attributes-pcmk_host_check" name="pcmk_host_check" value="static-list" /> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <nvpair id="ec2-fencing-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="ip-10-50-3-122 ip-10-50-3-251" /> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ </instance_attributes> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <operations > Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <op id="ec2-fencing-interval-0" interval="0" name="monitor" start-delay="30s" timeout="150s" /> Nov 7 03:53:41 ip-10-50-3-122 crmd[18591]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ </operations> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ </primitive> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: log_cib_diff: cib:diff: Local-only Change: 1.1185.1 Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: -- <cib admin_epoch="1" epoch="1184" num_updates="1" /> Nov 7 03:53:41 ip-10-50-3-122 cib[18586]: notice: cib:diff: ++ <node id="ip-10-50-3-1251" uname="ip-10-50-3-1251" /> > >> >>> >>> # pcs status >>> Last updated: Thu Nov 7 17:41:21 2013 >>> Last change: Thu Nov 7 04:29:06 2013 via cibadmin on ip-10-50-3-122 >>> Stack: cman >>> Current DC: ip-10-50-3-122 - partition with quorum >>> Version: 1.1.8-7.el6-394e906 >>> 3 Nodes configured, unknown expected votes >>> 11 Resources configured. >>> >>> >>> Node ip-10-50-3-1251: UNCLEAN (offline) >>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ] >>> >>> Full list of resources: >>> >>> ClusterEIP_54.215.143.166 (ocf::pacemaker:EIP): Started >>> ip-10-50-3-122 >>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH] >>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ] >>> Stopped: [ EIP-AND-VARNISH:2 ] >>> ec2-fencing (stonith:fence_ec2): Stopped >>> >>> I have no idea where the node that is marked UNCLEAN came from, though it's >>> a clear typo is a proper cluster node. >>> >>> The only command I ran with the bad node ID was: >>> >>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node >>> ip-10-50-3-1251 >>> >>> Is there any possible way that could have caused the the node to be added? >>> >>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there is >>> no node and thus no pcsd that failed. Is there a way I can safely remove >>> this ghost node from the cluster? I can provide logs from pacemaker or >>> corosync as needed. >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org