After using the tutorial on the Hastexo site for setting up stonith via libvirt, I believe I have it working correctly...but...some strange things are happening. I have two nodes, with shared storage provided by a dual-primary DRBD resource and OCFS2. Here is one of my stonith primitives:

primitive p_fence-l2 stonith:external/libvirt \
params hostlist="l2:l2.sandbox" hypervisor_uri="qemu+ssh://matt@hv01/system" stonith-timeout="30" pcmk_host_check="none" \
        op start interval="0" timeout="15" \
        op stop interval="0" timeout="15" \
        op monitor interval="60" \
        meta target-role="Started"

This cluster has stonith-enabled="true" in the cluster options, plus the necessary location statements in the cib.

To watch the DLM, I run dbench on the shared storage on the node I let live. While it's running, I creatively nuke the other node. If I just "killall pacemakerd" on l2 for instance, the DLM seems unaffected and the fence takes place, rebooting the now "failed" node l2. No real interruption of service on the surviving node, l3. Yet, if I "halt -f -n" on l2, the fence still takes place but the surviving node's (l3's) DLM hangs and won't come back until I bring the failed node back online. Note that l2 and l3 can be interchanged - the results are the same. Note that when the DLM is hung as in the latter case, eventually kernel messages about hung tasks start populating the syslog.

I thought I had recently read some posts concerning this very topic, but for the life of me I can't find them...
Any ideas on how I should proceed, or what I should look for next?

Thanks!
-- Matt




_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to