On Tue, 10 Aug 2010, Igor Chudov wrote: > Guys, I just sent ha-log, ha.cf, haresources from both machines. > > At this point, I of course greatly appreciate your help and your > generous assistance. > > But I wonder if our attention is going in a wrong direction of "try > this and try that". > > What if right now, I need to systematically understand what exactly is > happening between them, how they decide who takes over, and why > exactly none of them decides to take over. > > Assuming this is my question, I want to know what I should explore to > understand what is happening (as opposed to trying more of same). > > Does this Heartbeat have a debug option beyond what I have already used?
not that I am aware of. I am not currently running the version you are, and in the older version I am used to looking at there is information in ha-log about each resource as it starts. > What needs to happen for one to take over? a box needs to be 'unhealthy' for the other box to take over when both boxes start up at the same time, the one listed in haresources will take the resources with 'autofailback on' the cluster will always try to migrate the resources to the system listed in haresources if it thinks it's healthy. > What is not present out of what is needed? the ha-log files you just sent show heartbeat shutting down, not starting. we need to logs of the startup to see what's happening. > Why is the primary Heartbeat not taking over back when the secondary > is obviously not providing resources? that depends on exactly is causing the primary to not start the resources. the haresources files you listed still say drbddisk and Dimitry identified that the correct thing was drbd. did you rename the script? if you did, you may need to edit the script so that the useage data that it reports matches. what I think is happening is that the primary starts, tries to bring up the resources, gets an error and releases them. at this point I think that one of two things are happening 1. since auto-failback is on, it tries again 2. the secondary tries to bring up the resource and gets an error and releases them. if shutting the primary down lets the backup work, I would suspect #1 if shutting the backup down lets the primary work, then I am puzzled. I try real hard not to use autofailback yes, it makes troubleshooting a flaky box hard because as soon as it comes up it becomes active. I've found that to cause additional outages (plus even if the box is fixed, it causes unneded failovers). I always make my primary and secondary boxes identical, so there is no performance reason for failing back (except on the _very_ rare cases where I run one service on the primary and a different one on the backup with each being failover for the other) I have had many cases where one bad resource entry would prevent things from starting. with the older version I am using I get a better error message than I saw you post, which let me find it more clearly. David Lang _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
