Oops, I omitted the cluster config : root@Vindemiatrix:/home/david# crm configure show | cat node Malastare \ attributes standby="off" node Vindemiatrix \ attributes standby="off" primitive OVHvIP ocf:pacemaker:OVHvIP primitive ProFTPd ocf:heartbeat:proftpd \ params conffile="/etc/proftpd/proftpd.conf" \ op monitor interval="60s" primitive VirtualIP ocf:heartbeat:IPaddr2 \ params ip="178.33.109.180" nic="eth0" cidr_netmask="32" primitive drbd_backupvi ocf:linbit:drbd \ params drbd_resource="backupvi" \ op monitor interval="15s" primitive drbd_pgsql ocf:linbit:drbd \ params drbd_resource="postgresql" \ op monitor interval="15s" primitive drbd_svn ocf:linbit:drbd \ params drbd_resource="svn" \ op monitor interval="15s" primitive drbd_www ocf:linbit:drbd \ params drbd_resource="www" \ op monitor interval="15s" primitive fs_backupvi ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/backupvi" directory="/var/backupvi" fstype="ext3" primitive fs_pgsql ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/postgresql" directory="/var/lib/postgresql" fstype="ext3" \ meta target-role="Started" primitive fs_svn ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/svn" directory="/var/lib/svn" fstype="ext3" \ meta target-role="Started" primitive fs_www ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/www" directory="/var/www" fstype="ext3" primitive soapi-fencing-malastare stonith:external/ovh \ params reversedns="ns208812.ovh.net" primitive soapi-fencing-vindemiatrix stonith:external/ovh \ params reversedns="ns235795.ovh.net" ms ms_drbd_backupvi drbd_backupvi \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ms ms_drbd_pgsql drbd_pgsql \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master" ms ms_drbd_svn drbd_svn \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master" ms ms_drbd_www drbd_www \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location stonith-malastare soapi-fencing-malastare -inf: Malastare location stonith-vindemiatrix soapi-fencing-vindemiatrix -inf: Vindemiatrix colocation FS_on_same_host inf: ms_drbd_backupvi:Master ms_drbd_svn:Master ms_drbd_www:Master ms_drbd_pgsql:Master colocation IPAddr2_with_OVHvIP inf: OVHvIP VirtualIP colocation IP_and_www inf: OVHvIP ms_drbd_www colocation ProFTPd_www inf: ProFTPd fs_www colocation backupvi inf: fs_backupvi ms_drbd_backupvi:Master colocation pgsql_coloc inf: fs_pgsql ms_drbd_pgsql:Master colocation svn_coloc inf: fs_svn ms_drbd_svn:Master colocation www_coloc inf: fs_www ms_drbd_www:Master order IPAddr2_OVHvIP inf: OVHvIP:start VirtualIP:start order backupvi_order inf: ms_drbd_backupvi:promote fs_backupvi:start order pgsql_order inf: ms_drbd_pgsql:promote fs_pgsql:start order svn_order inf: ms_drbd_svn:promote fs_svn:start order www_order inf: ms_drbd_www:promote fs_www:start property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ default-resource-stickiness="50" \ no-quorum-policy="ignore"
Thank you in advance for your help. Kind regards. Le 05/07/2012 16:12, David Guyot a écrit : > Hello, everybody. > > As the title suggests, I'm configuring a 2-node cluster but I've got a > strange issue here : when I put a node in standby mode, using "crm node > standby", its resources are correctly moved to the second node, and stay > there even if the first is back on-line, which I assume is the preferred > behavior (preferred by the designers of such systems) to avoid having > resources on a potentially unstable node. Nevertheless, when I simulate > failure of the node which run resources by "/etc/init.d/corosync stop", > the other node correctly fence the failed node by electrically resetting > it, but it doesn't mean that it will mount resources on himself; rather, > it waits the failed node to be back on-line, and then re-negotiates > resource placement, which inevitably leads to the failed node restarting > the resources, which I suppose is a consequence of the resource > stickiness still recorded by the intact node : because this node still > assume that resources are running on the failed node, it assumes that > resources prefer to stay on the first node, even if it has failed. > > When the first node, Vindemiatrix, has shuts down Corosync, the second, > Malastare, reports this : > > root@Malastare:/home/david# crm_mon --one-shot -VrA > ============ > Last updated: Thu Jul 5 15:27:01 2012 > Last change: Thu Jul 5 15:26:37 2012 via cibadmin on Malastare > Stack: openais > Current DC: Malastare - partition WITHOUT quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 2 Nodes configured, 2 expected votes > 17 Resources configured. > ============ > > Node Vindemiatrix: UNCLEAN (offline) > Online: [ Malastare ] > > Full list of resources: > > soapi-fencing-malastare (stonith:external/ovh): Started Vindemiatrix > soapi-fencing-vindemiatrix (stonith:external/ovh): Started Malastare > Master/Slave Set: ms_drbd_svn [drbd_svn] > Masters: [ Vindemiatrix ] > Slaves: [ Malastare ] > Master/Slave Set: ms_drbd_pgsql [drbd_pgsql] > Masters: [ Vindemiatrix ] > Slaves: [ Malastare ] > Master/Slave Set: ms_drbd_backupvi [drbd_backupvi] > Masters: [ Vindemiatrix ] > Slaves: [ Malastare ] > Master/Slave Set: ms_drbd_www [drbd_www] > Masters: [ Vindemiatrix ] > Slaves: [ Malastare ] > fs_www (ocf::heartbeat:Filesystem): Started Vindemiatrix > fs_pgsql (ocf::heartbeat:Filesystem): Started Vindemiatrix > fs_svn (ocf::heartbeat:Filesystem): Started Vindemiatrix > fs_backupvi (ocf::heartbeat:Filesystem): Started Vindemiatrix > VirtualIP (ocf::heartbeat:IPaddr2): Started Vindemiatrix > OVHvIP (ocf::pacemaker:OVHvIP): Started Vindemiatrix > ProFTPd (ocf::heartbeat:proftpd): Started Vindemiatrix > > Node Attributes: > * Node Malastare: > + master-drbd_backupvi:0 : 10000 > + master-drbd_pgsql:0 : 10000 > + master-drbd_svn:0 : 10000 > + master-drbd_www:0 : 10000 > > As you can see, the node failure is detected. This state leads to > attached log file. > > Note that both ocf::pacemaker:OVHvIP and stonith:external/ovh are custom > resources which uses my server provider's SOAP API to provide intended > services. The STONITH agent does nothing but returning exit status 0 > when start, stop, on or off actions are required, but returns the 2 > nodes names when hostlist or gethosts actions are required and, when > reset action is required, effectively resets faulting node using the > provider API. As this API doesn't provide reliable mean to know the > exact moment of resetting, the STONITH agent pings the faulting node > every 5 seconds until ping fails, then forks a process which pings the > faulting node every 5 seconds until it answers, then, due to external > VPN being not yet installed by the provider, I'm forced to emulate it > with OpenVPN (which seems to be unable to re-establish a connection lost > minutes ago, leading to a dual brain situation), the STONITH agent > restarts OpenVPN to re-establish the connection, then restarts Corosync > and Pacemaker. > > Aside from the VPN issue, of which I'm fully aware of performance and > stability issues, I thought that Pacemaker would, as soon as the STONITH > agent returns exit status 0, start the resources on the remaining node, > but it doesn't. Instead, it seems that the STONITH reset action waits > too long to report a successful reset, delay which reaches some internal > timeout, which in turn leads Pacemaker to assume that STONITH agent > failed, therefore, while eternally trying to reset the node (which only > leads to the API issuing an error because the last reset request was > less than 5 minutes ago, something forbidden) stopping actions without > restarting resources on the remaining node. I tried to search the > Internet to this parameter, but the only related thing I found is this > page > http://lists.linux-ha.org/pipermail/linux-ha/2010-March/039761.html, a > Linux-HA mailing list archive, which mentions a stonith-timeout > property, but I've parsed Pacemaker documentation without finding any > occurrence, and I got an error when I tried to get its value : > > root@Vindemiatrix:/home/david# crm_attribute --name stonith-timeout --query > scope=crm_config name=stonith-timeout value=(null) > Error performing operation: The object/attribute does not exist > > So what did I miss? Do I must use this property which is not documented > nor present in the documentation? Or rewrite my STONITH agent to return > exit status 0 as soon as the API correctly considered the reset request > (contrary to what Linux-HA http://linux-ha.org/wiki/STONITH precise to > be necessary)? Or is there something else I missed? > > Thank you now for having read this whole mail, and in advance for your help. > > Kind regards. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org