[Linux-HA] Unable to restart cluster using resource_failure_stickiness

Olivier BONHOMME Tue, 18 Sep 2012 04:50:24 -0700

Hello the list,

New user to the HA world, I am trying to create an active/passive
cluster unfortunately with an heartbeat 2.1.4 version for historical
reasons.


My resource stack is composed of a resource group with collected and
ordered flags set to true. This group is composed of the following
components :

- DRBD Disk
- VIP
- slapd
- postgresql
- application

I am trying to execute the following scenario : On a service failure, I
try to restart and after 5 failures, I move to the other node.

For doing that, I configured a ressource_failure_stickiness_value to 100
on the group and created the following location rule :

<constraints>
   <rsc_location id="location_failure" rsc="group_1">
         <rule id="prefered_location_failure" score="-500"/>
   </rsc_location>
</constraints>

After that, I start my cluster and simulates 5 failures on my node 0.
After these 5 failures, the cluster switches to the node 1 so it seems
to be the waited behaviour.

Then, I simulate 5 others failures on the node 1 and after these
failures, the node 1 stops and there is no switchover on node0. It seems
normal to be since there is alaways a fail-count to 5 for the node 0 and
also the node 1.

Now, I want to restart the cluster on the node 0, so I cleared the
fail-count for the resource but the cluster doesn't want to restart
anymore. Same thing if i do it on the node 1.

A crm_verify -LV shows me that the ressources cannot run anywhere and if
I use ptest in really verbose mode, I see the following message :

ptest[10784]: 2012/09/18_15:39:37 debug: native_assign_node: All nodes
for resource drbddisk_1 are unavailable, unclean or shutting down
(node1: 1, -1000000).

With a more verbose level, I see that both nodes seems to have a
'-INFINITY' score :
ptest[11819]: 2012/09/18_15:40:53 debug: debug2: node_list_update:
node1: -1000000 + 0
ptest[11819]: 2012/09/18_15:40:53 debug: debug2: node_list_update:
node0: -1000000 + 0

Restarting hearbeat didn't put back my cluster in nominal state. That's
why, I have several questions for this ML :

- Is there a known way to put back my cluster in a nominal state ?
- Did I correctly understand the ressource_failure_stickiness concept ?
- Can it be a bug on the 2.1.4 version which can force me to use HB3 +
pacemaker ?

Thanks in advance for your answers.

Regards,
Olivier BONHOMME




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Unable to restart cluster using resource_failure_stickiness

Reply via email to