On Wed, May 19, 2010 at 8:51 AM, Andrew Beekhof <[email protected]> wrote:
> On Tue, May 18, 2010 at 2:05 PM, mike <[email protected]> wrote: > > So now that I have a few clusters up and running after a few problems > > I've started looking at the logs with some regularity. I'm hoping > > someone can confirm my thoughts on some entries in the ha-log. > > > > 1. PEngine Recheck Timer (I_PE_CALC) just popped! > > Is this entry related to the next one? > > yes > > > 2. info: do_state_transition: Starting PEngine Recheck Timer > > This particular entry appears every 10 minutes or so and I'll see it 3 > > or 4 times in 1 or 2 minutes and then it will go away for 10 minutes. > > Then the cycle repeats. > > > > 3. info: native_merge_weights: mysql: Rolling back scores from ip_mysql. > > Now this entry I like. It appears to roll back the fail count (I think) > > No. Well not directly. > This happens when A optionally depends on B and factoring in B's > allocation preferences would mean A can't run anywhere. > > > which is what my DBA was looking for. He wants mysql to failover if > > there are 3 successive failures of MySQL but only if those successive > > failures occur within 15 minutes. > > You want migration-threshold=3 and failure-timeout=900000 (15 * 60 * 1000) > > Isn't failure-timeout defined as seconds? Or milliseconds? Also, in monitor available fields for a resource there are: - interval, default 0 Does it mean no monitor at all if I don't specify a number different from zero? Is this that determines no monitor at all by default, or is it instead the "enabled" (see below) field set to false? Or both? - enabled, default is true I suppose, if I insert the monitor line inside the resource, but don't give a value to this field, correct? I also found this description about the two parameters migration-threshold and failure-timeout Moving Resources Due to Failure New in 1.0 is the concept of a migration threshold . Simply define migration-threshold=N for a resource and it will migrate to a new node after N failures. There is no threshold defined by default. Todetermine the resource's current failure status and limits, use crm_mon --failcounts By default, once the threshold has been reached, node will no longer be allowed to run the failed resource until the administrator manually resets the resource's failcount using crm_failcount (after hopefully first fixing the failure's cause). However it is possible to expire them by setting the resource's failure-timeout option. So, after your comments, suppose migration threshold set to 3 and failure-timeout set to 15 minutes, is this below the expected behavior? time 0 R1 starts time 5 min R1 fails ---> counter begins for failure-timeout and cluster successfully restarts it (in place? Where is it set that restart operation has to be tried in place or on the other node?) ; failcount=1 time 10 min R1 fails again ---> counter resets for failure-timeout and cluster successfully restarts it ; failcount=2 time 25 min R1 has not failed again ---> failcount is reset to 0 and we are in similar condition as in time 0 Thanks Gianluca _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
