On Thu, May 20, 2010 at 3:30 PM, mike <[email protected]> wrote: > Gianluca Cecchi wrote: >> On Thu, May 20, 2010 at 2:45 PM, mike <[email protected]> wrote: >> >> >>> ok, I actually went ahead and did a test on my cluster. The results did >>> not occur as I would have expected. >>> >>> I failed ldirectord twice on the main node. I waited 20 minutes and saw >>> this entry in the log file: >>> May 20 08:23:10 lvsuat1a.intranet.mydomain.com pengine: [6589]: notice: >>> get_failcount: Failcount for ldirectord on >>> lvsuat1a.intranet.mydomain.com has expired (limit was 900s) >>> >>> So now I kill ldirectord again, fully expecting it to restart on the >>> same node but instead a failover occurs: >>> May 20 08:36:15 lvsuat1a.intranet.mydomain.com pengine: [6589]: WARN: >>> common_apply_stickiness: Forcing ldirectord away from >>> lvsuat1a.intranet.mydomain.com after 3 failures (max=3) >>> >>> >>> >> So your version of pacemaker should be a 1.0.x one. >> In fact Andrew wrote that the reset is not automatic for that version, while >> it should be for upcoming 1.1 >> >> Gianluca >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> >> >> > > Yes, he said that In 1.0 it becomes ignored after the specified > interval. I wasn't sure what he meant by that. I thought perhaps he > meant it would ignore future failures and not fail over.
No, sorry. In 1.0 you have to clear out the fail-counts manually. Yes, its not ideal. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
