On Wed, Sep 3, 2008 at 03:27, Matt Zagrabelny <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Is there a mechanism to retry to start failed resources?
>
> Here is my situation:
>
> +-----------+
> | DB server |
> +-----------+
>      |
> +-----------+
> |   Router  |
> +-----------+
>      |
>    [VIP]
> +---+   +---+
> | A |===| B |
> +---+   +---+
>
> I have a firewall authentication cluster with nodes A and B.
> I have custom resource agents running on the cluster, the monitor action
> of the RAs indirectly makes connections to the DB server.
>
> If Router fails, then the monitor action is of the belief that the
> resource has failed (which it hasn't) and fails the resource to the
> other node (which won't do it any good, because the problem was with
> Router). It then fails on the other node (for the same reasons it failed
> on the first one).
>
> Another way to look at the above problem (this actually happened to the
> cluster I work with)
>
> Suppose both nodes (A,B) are alive and can run resource X.
>
> A -> running X
> B -> backup
>
> Router fails
>
> A -> X has failed on this node and the cluster will not attempt to run
> the resource here until cleared with crm_resource.
> B -> running X
>
> Router is still in failure state
>
> A -> X has failed on this node and the cluster will not attempt to run
> the resource here until cleared with crm_resource.
> B -> X has failed on this node and the cluster will not attempt to run
> the resource here until cleared with crm_resource.
>
> Router failure is repaired. However the cluster went from a working,
> operational cluster to a broken one.
>
> So, I have a couple questions regarding this:
>
> 1) Is there a way to have on_fail do different things based on the
> number of runnable nodes available for a resource?

no

>
> Let n be the number of nodes that can run a resource (say X).
>
> If n > 1 then on_fail = 'restart'.
> If n == 1 then on_fail = 'block'.
>
> This way, if the problem that caused the failure is outside of the
> cluster gets resolved, then the cluster can still be running the
> resources when the outside problem is fixed.
>
> 2) Is there some way to test to see if a problem has cleared?
>
> I know that I can run crm_resource --cleanup to have a node try to run
> the resource again. Is there an automated equivalent for the cib.xml?
>
> Like:
> <op id="retry_id" interval="10s" name="retry_failed_RA" timeout="20s"/>

not in Pacemaker 0.6 but in 0.7 there is the ability to expire failures

see "Migration due to failure" in:
  http://clusterlabs.org/mw/Image:Configuration_Explained_1.0.pdf
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to