It seems the two nodes in my cluster are behaving differently from each other. First, some simplification/mapping for node names to compare to the attached logs:
node1 - arc-tkincaidlx node2 - arc-dknightlx And references to the resource group include Filesystem, pgsql, and IPaddr colocated and ordered resources Heartbeat shutdowns and restarts on node1, regardless of whether it is DC, has active resources, etc, all perform as expected. If the resources are on node1, they migrate successfully to node2. If the location constraint sets the resources to node1, and node1 re-enters the cluster, all resources migrate back. Its when ANY heartbeat stop, start, restart, occurs on node2 that things break. For instance: node1 is DC, master rsc_drbd_7788:1, group active node2 is slave rsc_drbd_7788:0 ONLY /etc/init.d/heartbeat stop is executed on node2 node1 tries to execute a demote on rsc_drbd_7788:1 demote fails because group is active on node1, Filesystem is holding the drbd device open via mount point heartbeat continues to loop trying to demote on node1, about 9 times a second heartbeat on node2, where stop was executed, loops calling notify/pre/demote on rsc_drbd_7788:0, about once a second It takes a manual kill of heartbeat to get things back in order, and in the mean time drbd goes split brain, or so it seems by what I have to do to manually get drbd connected again. So, the problem is that heartbeat thinks it needs to demote the master rsc_drbd_7788:1 resource, and even if this was correct, it doesn't handle the group resources that are dependent on it and ordered/colocated with it. The attached logs cover the entire sequence of events during the shutdown of heartbeat on node2. Times of significance to help in looking at the logs are: Node2 HB shutdown started at 14:03:31 Manually started killing HB on node2 at 14:05:33 Node2 completed HB shutdown at 14:06:03 Node2 Timer pop at 14:06:33 Node1 HB shutdown to try to alleviate looping at 14:07:51 The logs are kind of large due to the looping (I deleted most of the looping, so if more info is needed I can provide the complete logs), and I've zipped them up, so if this email exceeds the list's size limits I respectfully ask the moderator to allow it to go through. Doug Knight WSI, Inc. > > > > digging into that now. If I shutdown the node that does not have the > > > > active resources, the following happens: > > > > > > > > (State: DC on active node1, running drbd master and group resources) > > > > shutdown node2 > > > > demote attempted on node1 for drbd master, > > > > > > Why demote? It's master running on a good node. > > > > > > > Don't know, this is what I observed. I wondered why it would do a demote > > when this node is already OK. > > > > > > no attempt at halting groups > > > > resources that depend on drbd > > > > > > Why should the resources be stopped? You shutdown a node which > > > doesn't have any resources. > > > > > truncated...
cibadmin.xml.gz
Description: GNU Zip compressed data
ha.debug.keep.node2.gz
Description: GNU Zip compressed data
ha.debug.keep.node1.small.gz
Description: GNU Zip compressed data
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
