Any suggestion on what the problem might be? I'm trying to setup a 2-node HA iSCSI cluster but I am having some problems with the failover. I want each node to be active with half of the storage available in each node. So each node will have half the storage and it will be served by different IP/target/lun. I have it working until I have an initiator setup and test failover. If I have an initiator (proxmox) that is using the target/lun, then when I migrate lun/target/drbd to the other node, I get inconsistent results.
Sometimes it works fine, but it usually fails. It seems to be that it fails in stopping the target after stopping the LUN. At that point the node is fenced (rebooted). After fencing, sometimes the resources will migrate to the surviving node successfully, sometimes it will wait until the fenced node comes back online and then move the resources, and sometimes it will just remain in the failed state. Sometimes, after the fenced node boots back up, it will try and move the resources back to itself and this will fail and the two nodes will continue to fence each other until there is some manual intervention. Any assistance if getting this setup correctly would be appreciated. Relevant details below. Setup would look like: On Storage1001A Lun:vol01 - Target:data01 - IP1/IP11 - LV:lv_vol01 - VG:vg_data01 - DRBD:data01 On Storage1001B Lun:vol01 - Target:data02 - IP2/IP22 - LV:lv_vol01 - VG:vg_data02 - DRBD:data02 Versions: Linux Storage1001A.xdomain.com 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux targetcli 2.0rc1-2 lio-utils 3.1+git2.fd0b34fd-2 drbd8-utils 2:8.3.13-2 corosync 1.4.2-3 pacemaker 1.1.7-1 Network is setup with 3 bonded pairs - Bond0: Client interface - Bond1: Crossover - Bond2: Client interface (for future multicast) Log file shows - Working fine - Migrate from Storage1001A to Storage1001B - Storage1001A hangs/crashes after stopping LUN before stopping Target - Storage1001A is fenced - Resources migrate to Storage1001B (not 100% sure on this) - Storage1001A boots back up and tries to take back resources - Storage1001B hangs after stopping LUN before stopping Target - Manually cycle power on Storage1001B - Resources remain "stuck" on Storage1001B until Storage1001B is back online (stonith currently disabled for Storage1001B, but sometimes it remains "stuck" when enabled). ************************************************************ *** Normal Status ************************************************************ # crm status ============ Last updated: Sat Dec 14 14:58:04 2013 Last change: Sat Dec 14 14:45:43 2013 via crm_resource on Storage1001B.xdomain.com Stack: openais Current DC: Storage1001A.xdomain.com - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 22 Resources configured. ============ Online: [ Storage1001B.xdomain.com Storage1001A.xdomain.com ] st_node_a (stonith:external/synaccess): Started Storage1001B.xdomain.com Resource Group: rg_data01 p_lvm01 (ocf::heartbeat:LVM): Started Storage1001A.xdomain.com p_ip1 (ocf::heartbeat:IPaddr): Started Storage1001A.xdomain.com p_ip11 (ocf::heartbeat:IPaddr): Started Storage1001A.xdomain.com p_target_data01 (ocf::heartbeat:iSCSITarget): Started Storage1001A.xdomain.com p_lu_data01_vol01 (ocf::heartbeat:iSCSILogicalUnit): Started Storage1001A.xdomain.com p_email_admin1 (ocf::heartbeat:MailTo): Started Storage1001A.xdomain.com Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ Storage1001A.xdomain.com ] Slaves: [ Storage1001B.xdomain.com ] Master/Slave Set: ms_drbd2 [p_drbd2] Masters: [ Storage1001B.xdomain.com ] Slaves: [ Storage1001A.xdomain.com ] Clone Set: c_lsb_target [p_target] Started: [ Storage1001A.xdomain.com Storage1001B.xdomain.com ] Clone Set: c_ping [p_ping] Started: [ Storage1001B.xdomain.com Storage1001A.xdomain.com ] ************************************************************ *** Status after migrate ************************************************************ # crm resource migrate rg_data01 Storage1001B.xdomain.com # crm status ============ Last updated: Sat Dec 14 16:40:55 2013 Last change: Sat Dec 14 16:40:48 2013 via crm_resource on Storage1001B.xdomain.com Stack: openais Current DC: Storage1001A.xdomain.com - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 22 Resources configured. ============ Online: [ Storage1001B.xdomain.com Storage1001A.xdomain.com ] st_node_a (stonith:external/synaccess): Started Storage1001B.xdomain.com Resource Group: rg_data01 p_lvm01 (ocf::heartbeat:LVM): Started Storage1001A.xdomain.com p_ip1 (ocf::heartbeat:IPaddr): Started Storage1001A.xdomain.com p_ip11 (ocf::heartbeat:IPaddr): Started Storage1001A.xdomain.com p_target_data01 (ocf::heartbeat:iSCSITarget): Started Storage1001A.xdomain.com p_lu_data01_vol01 (ocf::heartbeat:iSCSILogicalUnit): Stopped p_email_admin1 (ocf::heartbeat:MailTo): Stopped Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ Storage1001A.xdomain.com ] Slaves: [ Storage1001B.xdomain.com ] Master/Slave Set: ms_drbd2 [p_drbd2] Masters: [ Storage1001B.xdomain.com ] Slaves: [ Storage1001A.xdomain.com ] Clone Set: c_lsb_target [p_target] Started: [ Storage1001A.xdomain.com Storage1001B.xdomain.com ] Clone Set: c_ping [p_ping] Started: [ Storage1001B.xdomain.com Storage1001A.xdomain.com ] ************************************************************ *** Status after first node boots back up and tries to take back resource ************************************************************ # crm status ============ Last updated: Tue Dec 17 14:06:15 2013 Last change: Tue Dec 17 14:05:44 2013 via cibadmin on Storage1001A.xdomain.com Stack: openais Current DC: Storage1001B.xdomain.com - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 22 Resources configured. ============ Node Storage1001B.xdomain.com: UNCLEAN (online) Online: [ Storage1001A.xdomain.com ] st_node_a (stonith:external/synaccess): Started Storage1001B.xdomain.com Resource Group: rg_data01 p_lvm01 (ocf::heartbeat:LVM): Started Storage1001B.xdomain.com p_ip1 (ocf::heartbeat:IPaddr): Started Storage1001B.xdomain.com p_ip11 (ocf::heartbeat:IPaddr): Started Storage1001B.xdomain.com p_target_data01 (ocf::heartbeat:iSCSITarget): Started Storage1001B.xdomain.com FAILED p_lu_data01_vol01 (ocf::heartbeat:iSCSILogicalUnit): Stopped p_email_admin1 (ocf::heartbeat:MailTo): Stopped Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ Storage1001B.xdomain.com ] Slaves: [ Storage1001A.xdomain.com ] Master/Slave Set: ms_drbd2 [p_drbd2] Masters: [ Storage1001B.xdomain.com ] Slaves: [ Storage1001A.xdomain.com ] Clone Set: c_lsb_target [p_target] Started: [ Storage1001A.xdomain.com Storage1001B.xdomain.com ] Clone Set: c_ping [p_ping] Started: [ Storage1001A.xdomain.com Storage1001B.xdomain.com ] Failed actions: p_target_data01_stop_0 (node=Storage1001B.xdomain.com, call=111, rc=-2, status=Timed Out): unknown exec error ************************************************************ *** Configuration *** Note: Some resources are stopped to try and get one *** resource group working properly ************************************************************ node Storage1001A.xdomain.com node Storage1001B.xdomain.com primitive p_drbd1 ocf:linbit:drbd \ params drbd_resource="Data01" \ op monitor interval="3" role="Master" timeout="9" \ op monitor interval="4" role="Slave" timeout="12" primitive p_drbd2 ocf:linbit:drbd \ params drbd_resource="Data02" \ op monitor interval="3" role="Master" timeout="9" \ op monitor interval="4" role="Slave" timeout="12" primitive p_email_admin1 ocf:heartbeat:MailTo \ params email="ad...@xdomain.com" subject="Cluster Failover" primitive p_email_admin2 ocf:heartbeat:MailTo \ params email="ad...@xdomain.com" subject="Cluster Failover" primitive p_ip1 ocf:heartbeat:IPaddr \ params ip="10.11.2.13" nic="bond0" cidr_netmask="21" \ op monitor interval="5s" primitive p_ip11 ocf:heartbeat:IPaddr \ params ip="10.11.10.13" nic="bond2" cidr_netmask="21" \ op monitor interval="5s" primitive p_ip2 ocf:heartbeat:IPaddr \ params ip="10.11.2.14" nic="bond0" cidr_netmask="21" \ op monitor interval="5s" primitive p_ip22 ocf:heartbeat:IPaddr \ params ip="10.11.10.14" nic="bond2" cidr_netmask="21" \ op monitor interval="5s" primitive p_lu_data01_vol01 ocf:heartbeat:iSCSILogicalUnit \ params target_iqn="iqn.2013-10.com.xdomain.storage1001.data01" lun="1" path="/dev/vg_data01/lv_vol01" implementation="lio" \ op monitor interval="10" primitive p_lu_data02_vol01 ocf:heartbeat:iSCSILogicalUnit \ params target_iqn="iqn.2013-10.com.xdomain.storage1001.data02" lun="1" path="/dev/vg_data02/lv_vol01" implementation="lio" \ op monitor interval="10" primitive p_lvm01 ocf:heartbeat:LVM \ params volgrpname="vg_data01" \ op monitor interval="4" timeout="8" primitive p_lvm02 ocf:heartbeat:LVM \ params volgrpname="vg_data02" \ op monitor interval="4" timeout="8" primitive p_ping ocf:pacemaker:ping \ op monitor interval="5s" timeout="15s" \ params host_list="10.11.10.1" multiplier="200" name="p_ping" primitive p_target lsb:target \ op monitor interval="30" timeout="30" primitive p_target_data01 ocf:heartbeat:iSCSITarget \ params iqn="iqn.2013-10.com.xdomain.storage1001.data01" implementation="lio" \ op monitor interval="10s" timeout="20s" primitive p_target_data02 ocf:heartbeat:iSCSITarget \ params iqn="iqn.2013-10.com.xdomain.storage1001.data02" implementation="lio" \ op monitor interval="10s" timeout="20s" primitive st_node_a stonith:external/synaccess \ params synaccessip="reboot11.xdomain.com" community="*******" port="Storage1001A" pcmk_host_list="Storage1001A.xdomain.com" \ meta target-role="Started" primitive st_node_b stonith:external/synaccess \ params synaccessip="reboot10.xdomain.com" community="******" port="Storage1001B" pcmk_host_list="Storage1001B.xdomain.com" \ meta target-role="Stopped" group rg_data01 p_lvm01 p_ip1 p_ip11 p_target_data01 p_lu_data01_vol01 p_email_admin1 \ meta target-role="Started" group rg_data02 p_lvm02 p_ip2 p_ip22 p_target_data02 p_lu_data02_vol01 p_email_admin2 \ meta target-role="Stopped" ms ms_drbd1 p_drbd1 \ meta notify="true" master-max="1" clone-max="2" clone-node-max="1" target-role="Started" \ meta resource-stickiness="101" ms ms_drbd2 p_drbd2 \ meta notify="true" master-max="1" clone-max="2" clone-node-max="1" target-role="Started" \ meta resource-stickiness="101" clone c_lsb_target p_target \ meta target-role="Started" clone c_ping p_ping \ meta globally-unique="false" target-role="Started" location data01_prefer_a ms_drbd1 \ rule $id="data01_prefer_a_rule" $role="Master" 100: #uname eq Storage1001A.xdomain.com location data02_prefer_b ms_drbd2 \ rule $id="data02_prefer_b_rule" $role="Master" 100: #uname eq Storage1001B.xdomain.com location st_node_a-loc st_node_a \ rule $id="st_node_a-loc-id" -inf: #uname eq Storage1001A.xdomain.com location st_node_b-loc st_node_b \ rule $id="st_node_b-loc-id" -inf: #uname eq Storage1001B.xdomain.com colocation c_drbd1 inf: rg_data01 ms_drbd1:Master colocation c_drbd2 inf: rg_data02 ms_drbd2:Master order o_data01_start inf: ms_drbd1:promote rg_data01:start order o_data01_stop inf: rg_data01:stop ms_drbd1:demote order o_data02_start inf: ms_drbd2:promote rg_data02:start order o_data02_stop inf: rg_data02:stop ms_drbd2:demote order o_target_before_data01 inf: c_lsb_target:start rg_data01 order o_target_before_data02 inf: c_lsb_target:start rg_data02 property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="true" \ no-quorum-policy="ignore" \ default-resource-stickiness="1" \ last-lrm-refresh="1387064240" _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org