can you open a bug for this and include the _complete_ logs as well as
which version you're running (as I no longer recall)

On 5/4/07, Doug Knight <[EMAIL PROTECTED]> wrote:
It seems the two nodes in my cluster are behaving differently from each
other. First, some simplification/mapping for node names to compare to
the attached logs:

node1 - arc-tkincaidlx
node2 - arc-dknightlx

And references to the resource group include Filesystem, pgsql, and
IPaddr colocated and ordered resources

Heartbeat shutdowns and restarts on node1, regardless of whether it is
DC, has active resources, etc, all perform as expected. If the resources
are on node1, they migrate successfully to node2. If the location
constraint sets the resources to node1, and node1 re-enters the cluster,
all resources migrate back. Its when ANY heartbeat stop, start, restart,
occurs on node2 that things break. For instance:

node1 is DC, master rsc_drbd_7788:1, group active
node2 is slave rsc_drbd_7788:0 ONLY
/etc/init.d/heartbeat stop is executed on node2
node1 tries to execute a demote on rsc_drbd_7788:1
demote fails because group is active on node1, Filesystem is holding the
drbd device open via mount point
heartbeat continues to loop trying to demote on node1, about 9 times a
second
heartbeat on node2, where stop was executed, loops calling
notify/pre/demote on rsc_drbd_7788:0, about once a second

It takes a manual kill of heartbeat to get things back in order, and in
the mean time drbd goes split brain, or so it seems by what I have to do
to manually get drbd connected again. So, the problem is that heartbeat
thinks it needs to demote the master rsc_drbd_7788:1 resource, and even
if this was correct, it doesn't handle the group resources that are
dependent on it and ordered/colocated with it. The attached logs cover
the entire sequence of events during the shutdown of heartbeat on node2.
Times of significance to help in looking at the logs are:

Node2 HB shutdown started at 14:03:31
Manually started killing HB on node2 at 14:05:33
Node2 completed HB shutdown at 14:06:03
Node2 Timer pop at 14:06:33
Node1 HB shutdown to try to alleviate looping at 14:07:51

 The logs are kind of large due to the looping (I deleted most of the
looping, so if more info is needed I can provide the complete logs), and
I've zipped them up, so if this email exceeds the list's size limits I
respectfully ask the moderator to allow it to go through.

Doug Knight
WSI, Inc.


> > > > digging into that now. If I shutdown the node that does not have the
> > > > active resources, the following happens:
> > > >
> > > > (State: DC on active node1, running drbd master and group resources)
> > > > shutdown node2
> > > > demote attempted on node1 for drbd master,
> > >
> > > Why demote? It's master running on a good node.
> > >
> >
> > Don't know, this is what I observed. I wondered why it would do a demote
> > when this node is already OK.
> >
> > > > no attempt at halting groups
> > > > resources that depend on drbd
> > >
> > > Why should the resources be stopped? You shutdown a node which
> > > doesn't have any resources.
> > >
> >

truncated...

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to