On Tue, 10 Aug 2010, Igor Chudov wrote:

> Guys, I just sent ha-log, ha.cf, haresources from both machines.
>
> At this point, I of course greatly appreciate your help and your
> generous assistance.
>
> But I wonder if our attention is going in a wrong direction of "try
> this and try that".
>
> What if right now, I need to systematically understand what exactly is
> happening between them, how they decide who takes over, and why
> exactly none of them decides to take over.
>
> Assuming this is my question, I want to know what I should explore to
> understand what is happening (as opposed to trying more of same).
>
> Does this Heartbeat have a debug option beyond what I have already used?

not that I am aware of. I am not currently running the version you are, and in 
the older version I am used to looking at there is information in ha-log about 
each resource as it starts.

> What needs to happen for one to take over?

a box needs to be 'unhealthy' for the other box to take over

when both boxes start up at the same time, the one listed in haresources will 
take the resources

with 'autofailback on' the cluster will always try to migrate the resources to 
the system listed in haresources if it thinks it's healthy.

> What is not present out of what is needed?

the ha-log files you just sent show heartbeat shutting down, not starting. we 
need to logs of the startup to see what's happening.

> Why is the primary Heartbeat not taking over back when the secondary
> is obviously not providing resources?

that depends on exactly is causing the primary to not start the resources.

the haresources files you listed still say drbddisk and Dimitry identified that 
the correct thing was drbd. did you rename the script? if you did, you may need 
to edit the script so that the useage data that it reports matches.

what I think is happening is that the primary starts, tries to bring up the 
resources, gets an error and releases them.

at this point I think that one of two things are happening

1. since auto-failback is on, it tries again

2. the secondary tries to bring up the resource and gets an error and releases 
them.


if shutting the primary down lets the backup work, I would suspect #1

if shutting the backup down lets the primary work, then I am puzzled.


I try real hard not to use autofailback yes, it makes troubleshooting a flaky 
box hard because as soon as it comes up it becomes active. I've found that to 
cause additional outages (plus even if the box is fixed, it causes unneded 
failovers). I always make my primary and secondary boxes identical, so there is 
no performance reason for failing back (except on the _very_ rare cases where I 
run one service on the primary and a different one on the backup with each 
being 
failover for the other)




I have had many cases where one bad resource entry would prevent things from 
starting. with the older version I am using I get a better error message than I 
saw you post, which let me find it more clearly.


David Lang
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to