On Wed, Sep 3, 2008 at 03:27, Matt Zagrabelny <[EMAIL PROTECTED]> wrote: > Hello, > > Is there a mechanism to retry to start failed resources? > > Here is my situation: > > +-----------+ > | DB server | > +-----------+ > | > +-----------+ > | Router | > +-----------+ > | > [VIP] > +---+ +---+ > | A |===| B | > +---+ +---+ > > I have a firewall authentication cluster with nodes A and B. > I have custom resource agents running on the cluster, the monitor action > of the RAs indirectly makes connections to the DB server. > > If Router fails, then the monitor action is of the belief that the > resource has failed (which it hasn't) and fails the resource to the > other node (which won't do it any good, because the problem was with > Router). It then fails on the other node (for the same reasons it failed > on the first one). > > Another way to look at the above problem (this actually happened to the > cluster I work with) > > Suppose both nodes (A,B) are alive and can run resource X. > > A -> running X > B -> backup > > Router fails > > A -> X has failed on this node and the cluster will not attempt to run > the resource here until cleared with crm_resource. > B -> running X > > Router is still in failure state > > A -> X has failed on this node and the cluster will not attempt to run > the resource here until cleared with crm_resource. > B -> X has failed on this node and the cluster will not attempt to run > the resource here until cleared with crm_resource. > > Router failure is repaired. However the cluster went from a working, > operational cluster to a broken one. > > So, I have a couple questions regarding this: > > 1) Is there a way to have on_fail do different things based on the > number of runnable nodes available for a resource?
no > > Let n be the number of nodes that can run a resource (say X). > > If n > 1 then on_fail = 'restart'. > If n == 1 then on_fail = 'block'. > > This way, if the problem that caused the failure is outside of the > cluster gets resolved, then the cluster can still be running the > resources when the outside problem is fixed. > > 2) Is there some way to test to see if a problem has cleared? > > I know that I can run crm_resource --cleanup to have a node try to run > the resource again. Is there an automated equivalent for the cib.xml? > > Like: > <op id="retry_id" interval="10s" name="retry_failed_RA" timeout="20s"/> not in Pacemaker 0.6 but in 0.7 there is the ability to expire failures see "Migration due to failure" in: http://clusterlabs.org/mw/Image:Configuration_Explained_1.0.pdf _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
