I'm attempting to configure an NFS cluster, and I've observed that under some failure conditions, resources that depend on a failed resource simply stop, and no migration to another node is attempted, even though a manual migration demonstrates the other node can run all resources, and the resources will remain on the good node even after the migration constraint is removed.

I was able to reduce the configuration to this:

node storage01
node storage02
primitive drbd_nfsexports ocf:pacemaker:Stateful
primitive fs_test ocf:pacemaker:Dummy
primitive vg_nfsexports ocf:pacemaker:Dummy
group test fs_test
ms drbd_nfsexports_ms drbd_nfsexports \
        meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" target-role="Started"
location l fs_test -inf: storage02
colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) ( drbd_nfsexports_ms:Master )
property $id="cib-bootstrap-options" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        last-lrm-refresh="1339793579"

The location constraint "l" exists only to demonstrate the problem; I added it to simulate the NFS server being unrunnable on one node.

To see the issue I'm experiencing, put storage01 in standby to force everything on storage02. fs_test will not be able to run. Now bring storage01, which can satisfy all the constraints, and see that no migration takes place. Put storage02 in standby, and everything will migrate to storage01 and start successfully. Take storage02 out of standby, and the services remain on storage01. This demonstrates that even though there is a clear "best" solution where all resources can run, Pacemaker isn't finding it.

So far, I've noticed any of the following changes will "fix" the problem:

- removing colo_drbd_master
- removing any one resource from colo_drbd_master
- eliminating the group "test" and referencing fs_test directly in constraints
- using a simple clone instead of a master/slave pair for drbd_nfsexports_ms

My current understanding is that if there exists a way to run all resources, Pacemaker should find it and prefer it. Is that not the case? Maybe I need to restructure my colocation constraint somehow? Obviously this is a much reduced version of a more complex practical configuration, so I'm trying to understand the underlying mechanisms more than just the solution to this particular scenario.

In particular, I'm not really sure how I inspect what Pacemaker is thinking when it places resources. I've tried running crm_simulate -LRs, but I'm a little bit unclear on how to interpret the results. In the output, I do see this:

drbd_nfsexports:1 promotion score on storage02: 10
drbd_nfsexports:0 promotion score on storage01: 5

those numbers seem to account for the default stickiness of 1 for master/slave resources, but don't seem to incorporate at all the colocation constraints. Is that expected?


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to