Hello,

Is there a mechanism to retry to start failed resources?

Here is my situation:

+-----------+
| DB server |
+-----------+
      |
+-----------+
|   Router  |
+-----------+
      |
    [VIP]
+---+   +---+
| A |===| B |
+---+   +---+

I have a firewall authentication cluster with nodes A and B.
I have custom resource agents running on the cluster, the monitor action
of the RAs indirectly makes connections to the DB server.

If Router fails, then the monitor action is of the belief that the
resource has failed (which it hasn't) and fails the resource to the
other node (which won't do it any good, because the problem was with
Router). It then fails on the other node (for the same reasons it failed
on the first one).

Another way to look at the above problem (this actually happened to the
cluster I work with)

Suppose both nodes (A,B) are alive and can run resource X.

A -> running X
B -> backup

Router fails

A -> X has failed on this node and the cluster will not attempt to run
the resource here until cleared with crm_resource.
B -> running X

Router is still in failure state

A -> X has failed on this node and the cluster will not attempt to run
the resource here until cleared with crm_resource.
B -> X has failed on this node and the cluster will not attempt to run
the resource here until cleared with crm_resource.

Router failure is repaired. However the cluster went from a working,
operational cluster to a broken one.

So, I have a couple questions regarding this:

1) Is there a way to have on_fail do different things based on the
number of runnable nodes available for a resource?

Let n be the number of nodes that can run a resource (say X).

If n > 1 then on_fail = 'restart'.
If n == 1 then on_fail = 'block'.

This way, if the problem that caused the failure is outside of the
cluster gets resolved, then the cluster can still be running the
resources when the outside problem is fixed.

2) Is there some way to test to see if a problem has cleared?

I know that I can run crm_resource --cleanup to have a node try to run
the resource again. Is there an automated equivalent for the cib.xml?

Like:
<op id="retry_id" interval="10s" name="retry_failed_RA" timeout="20s"/>

TIA,

-- 
Matt Zagrabelny - [EMAIL PROTECTED] - (218) 726 8844
University of Minnesota Duluth
Information Technology Systems & Services
PGP key 1024D/84E22DA2 2005-11-07
Fingerprint: 78F9 18B3 EF58 56F5 FC85  C5CA 53E7 887F 84E2 2DA2

He is not a fool who gives up what he cannot keep to gain what he cannot
lose.
-Jim Elliot

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to