On Wed, May 19, 2010 at 8:51 AM, Andrew Beekhof <[email protected]> wrote:

> On Tue, May 18, 2010 at 2:05 PM, mike <[email protected]> wrote:
> > So now that I have a few clusters up and running after a few problems
> > I've started looking at the logs with some regularity. I'm hoping
> > someone can confirm my thoughts on some entries in the ha-log.
> >
> > 1. PEngine Recheck Timer (I_PE_CALC) just popped!
> > Is this entry related to the next one?
>
> yes
>
> > 2. info: do_state_transition: Starting PEngine Recheck Timer
> > This particular entry appears every 10 minutes or so and I'll see it 3
> > or 4 times in 1 or 2 minutes and then it will go away for 10 minutes.
> > Then the cycle repeats.
> >
> > 3. info: native_merge_weights: mysql: Rolling back scores from ip_mysql.
> > Now this entry I like. It appears to roll back the fail count (I think)
>
> No.  Well not directly.
> This happens when A optionally depends on B and factoring in B's
> allocation preferences would mean A can't run anywhere.
>
> > which is what my DBA was looking for. He wants mysql to failover if
> > there are 3 successive failures of MySQL but only if those successive
> > failures occur within 15 minutes.
>
> You want migration-threshold=3 and failure-timeout=900000 (15 * 60 * 1000)
>
>
Isn't failure-timeout defined as seconds? Or milliseconds?
Also, in monitor available fields for a resource there are:

- interval, default 0
Does it mean no monitor at all if I don't specify a number different from
zero?
Is this that determines no monitor at all by default, or is it instead the
"enabled" (see below) field set to false? Or both?

- enabled, default is true I suppose, if I insert the monitor line inside
the resource, but don't give a value to this field, correct?

I also found this description about the two parameters migration-threshold
and failure-timeout

Moving Resources Due to Failure
New in 1.0 is the concept of a migration threshold . Simply define
migration-threshold=N for a resource and it will migrate to a new node after
N failures. There is no threshold defined by default. Todetermine the
resource's current failure status and limits, use crm_mon --failcounts
By default, once the threshold has been reached, node will no longer be
allowed to run the failed resource until the administrator manually resets
the resource's failcount using crm_failcount (after hopefully first fixing
the failure's cause). However it is possible to expire them by setting the
resource's failure-timeout option.

So, after your comments, suppose migration threshold set to 3 and
failure-timeout set to 15 minutes, is this below the expected behavior?

time 0 R1 starts
time 5 min R1 fails ---> counter begins for failure-timeout and cluster
successfully restarts it (in place? Where is it set that restart operation
has to be tried in place or on the other node?) ; failcount=1
time 10 min R1 fails again ---> counter resets for failure-timeout and
cluster successfully restarts it ; failcount=2
time 25 min R1 has not failed again ---> failcount is reset to 0 and we are
in similar condition as in time 0

Thanks
Gianluca
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to