On 05/21/2012 02:26 PM, Florian Haas wrote: > On Mon, May 21, 2012 at 8:14 PM, Matthew O'Connor <m...@ecsorl.com> wrote: >> On 05/21/2012 05:43 AM, Florian Haas wrote: >>> Does it have "fencing resource-and-stonith" in the DRBD configuration, >>> and stonith_admin-fence-peer.sh as its fence-peer handler? >> That was the problem. Totally forgot to update my DRBD configuration. > I actually wasn't saying that that was the root cause of your problem. > :) But it's worth looking into, anyhow.
Ah - well, for sake of barking up the right tree, here is a snippet of the logs of l2 after l3 was halted, and before making any changes to the DRBD configuration: May 19 23:00:13 l2 stonith-ng: [1554]: info: initiate_remote_stonith_op: Initiating remote operation reboot for l3: b1374d19-458b-4520-9cbf-e2e5812e6639 May 19 23:00:13 l2 stonith-ng: [1554]: info: can_fence_host_with_device: p_fence-l3 can fence l3: none May 19 23:00:13 l2 stonith-ng: [1554]: info: call_remote_stonith: Requesting that l2 perform op reboot l3 May 19 23:00:13 l2 stonith-ng: [1554]: info: stonith_fence: Exec <stonith_command t="stonith-ng" st_async_id="b1374d19-458b-4520-9cbf-e2e5812e6639" st_op="st_fence" st_callid="0" st_callopt="0" st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_target="l3" st_device_action="reboot" st_timeout="54000" src="l2" seq="10" /> May 19 23:00:13 l2 stonith-ng: [1554]: info: can_fence_host_with_device: p_fence-l3 can fence l3: none May 19 23:00:13 l2 stonith-ng: [1554]: info: stonith_fence: Found 1 matching devices for 'l3' May 19 23:00:13 l2 stonith-ng: [1554]: info: stonith_command: Processed st_fence from l2: rc=-1 May 19 23:00:13 l2 stonith-ng: [1554]: info: make_args: reboot-ing node 'l3' as 'port=l3' May 19 23:00:14 l2 stonith-ng: [1554]: info: stonith_command: Processed st_execute from lrmd: rc=-1 May 19 23:00:19 l2 stonith-ng: [1554]: info: log_operation: Operation 'reboot' [7042] (call 0 from (null)) for host 'l3' with device 'p_fence-l3' returned: 0 May 19 23:00:19 l2 stonith-ng: [1554]: info: log_operation: p_fence-l3: Performing: stonith -t external/libvirt -T reset l3 May 19 23:00:19 l2 stonith-ng: [1554]: info: log_operation: p_fence-l3: success: l3 0 May 19 23:00:19 l2 stonith-ng: [1554]: info: process_remote_stonith_exec: ExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="11" /> May 19 23:00:19 l2 stonith-ng: [1554]: info: remote_op_done: Notifing clients of b1374d19-458b-4520-9cbf-e2e5812e6639 (reboot of l3 from 9f36c78b-06c8-4b62-bc84-6cb87b30351b by l2): 2, rc=0 May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback: StonithOp <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="reboot" st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="11" state="2" st_target="l3" /> May 19 23:00:19 l2 stonith-ng: [1554]: info: stonith_notify_client: Sending st_fence-notification to client 1559/b09a62f6-b077-4181-98da-91f43f40bc9a May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback: StonithOp <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="reboot" st_remote_op="b1374d19-458b-4520-9cbf-e2e5812e6639" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="11" state="2" st_target="l3" /> May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback: Stonith operation 4/82:118:0:b92bcccd-5765-469c-b56e-392cc065b65c: OK (0) May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_callback: Stonith of l3 passed May 19 23:00:19 l2 crmd: [1559]: info: send_stonith_update: Sending fencing update 358 for l3 May 19 23:00:19 l2 stonith-ng: [1554]: info: stonith_notify_client: Sending st_fence-notification to client 1559/b09a62f6-b077-4181-98da-91f43f40bc9a May 19 23:00:19 l2 crmd: [1559]: info: tengine_stonith_notify: Peer l3 was terminated (reboot) by l2 for l2 (ref=b1374d19-458b-4520-9cbf-e2e5812e6639): OK May 19 23:00:19 l2 crmd: [1559]: notice: tengine_stonith_notify: Notified CMAN that 'l3' is now fenced May 19 23:00:19 l2 crmd: [1559]: notice: tengine_stonith_notify: Confirmed CMAN fencing event for 'l3' AND here is a log snippet from after the DRBD configuration was updated. May 21 14:36:02 l2 stonith-ng: [1618]: info: initiate_remote_stonith_op: Initiating remote operation reboot for l3: 9c19ba05-363c-48b4-ade3-d9dac5087866 May 21 14:36:02 l2 stonith-ng: [1618]: info: can_fence_host_with_device: p_fence-l3 can fence l3: none May 21 14:36:02 l2 stonith-ng: [1618]: info: call_remote_stonith: Requesting that l2 perform op reboot l3 May 21 14:36:02 l2 stonith-ng: [1618]: info: stonith_fence: Exec <stonith_command t="stonith-ng" st_async_id="9c19ba05-363c-48b4-ade3-d9dac5087866" st_op="st_fence" st_callid="0" st_callopt="0" st_remote_op="9c19ba05-363c-48b4-ade3-d9dac5087866" st_target="l3" st_device_action="reboot" st_timeout="54000" src="l2" seq="20" /> May 21 14:36:02 l2 stonith-ng: [1618]: info: can_fence_host_with_device: p_fence-l3 can fence l3: none May 21 14:36:02 l2 stonith-ng: [1618]: info: stonith_fence: Found 1 matching devices for 'l3' May 21 14:36:02 l2 stonith-ng: [1618]: info: stonith_command: Processed st_fence from l2: rc=-1 May 21 14:36:02 l2 stonith-ng: [1618]: info: make_args: reboot-ing node 'l3' as 'port=l3' May 21 14:36:08 l2 stonith-ng: [1618]: info: log_operation: Operation 'reboot' [341] (call 0 from (null)) for host 'l3' with device 'p_fence-l3' returned: 0 May 21 14:36:08 l2 stonith-ng: [1618]: info: log_operation: p_fence-l3: Performing: stonith -t external/libvirt -T reset l3 May 21 14:36:08 l2 stonith-ng: [1618]: info: log_operation: p_fence-l3: success: l3 0 May 21 14:36:08 l2 stonith-ng: [1618]: info: process_remote_stonith_exec: ExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="9c19ba05-363c-48b4-ade3-d9dac5087866" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="21" /> May 21 14:36:08 l2 stonith-ng: [1618]: info: remote_op_done: Notifing clients of 9c19ba05-363c-48b4-ade3-d9dac5087866 (reboot of l3 from f782c9f8-71e1-4ec2-8f45-93a4b2f7f795 by l2): 2, rc=0 May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_callback: StonithOp <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="reboot" st_remote_op="9c19ba05-363c-48b4-ade3-d9dac5087866" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/libvirt -T reset l3#012success: l3 0#012" src="l2" seq="21" state="2" st_target="l3" /> May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_callback: Stonith operation 5/81:56:0:e647e4db-cb29-4db4-a0bc-b631fc35f5ec: OK (0) May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_callback: Stonith of l3 passed May 21 14:36:08 l2 crmd: [1623]: info: send_stonith_update: Sending fencing update 276 for l3 May 21 14:36:08 l2 stonith-ng: [1618]: info: stonith_notify_client: Sending st_fence-notification to client 1623/ffe204e9-3d5d-4a11-b605-084d3f61980d May 21 14:36:08 l2 crmd: [1623]: info: tengine_stonith_notify: Peer l3 was terminated (reboot) by l2 for l2 (ref=9c19ba05-363c-48b4-ade3-d9dac5087866): OK May 21 14:36:08 l2 stonith-ng: [1618]: info: stonith_device_execute: Nothing to do for p_fence-l3 May 21 14:36:08 l2 crmd: [1623]: notice: tengine_stonith_notify: Notified CMAN that 'l3' is now fenced May 21 14:36:08 l2 crmd: [1623]: notice: tengine_stonith_notify: Confirmed CMAN fencing event for 'l3' I am not sure this reveals much, but chances are you will see something I don't! ;-) >> For sake of testing, I used the "crm-fence-peer.sh" script - it seemed >> to do the trick, although I strongly suspect this is the wrong script >> for the job. > It is. No good for dual-Primary, really, as it doesn't prevent split > brain in that sort of configuration. Yes, that is perfectly sensible. Perhaps my (still-in-testing) production cluster's problem will be a bit simpler, then? The DRBD resource there is actually operated in single-primary mode on a two-node cluster, because it is served up over iSCSI to another cluster of machines. DLM/OCFS2 do not operate on the DRBD/iSCSI host cluster, only on the iSCSI client cluster. So, in this case, would the crm-fence-peer.sh then be sufficient for the DRBD cluster nodes? > >> Do I need to write my own script to call stonith_admin? > No, stonith_admin-fence-peer.sh ships with recent DRBD releases. Sadness...not found on Ubuntu 12.04. They are providing v8.3.11. I will check with them... Thanks!! -- Matthew > > Cheers, > Florian > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org