On Thu, May 03, 2007 at 11:00:48AM -0400, Doug Knight wrote:
> On Thu, 2007-05-03 at 16:12 +0200, Dejan Muhamedagic wrote:
> 
> > On Thu, May 03, 2007 at 09:08:12AM -0400, Doug Knight wrote:
> > > Thanks Dejan, I'll try the kill -9. One thing I'm seeing is that I can
> > > easily move the resources between nodes using the <location> constraint,
> > > but if I shutdown heartbeat on one node (/etc/init.d/heartbeat stop) I
> > > run into problems. If I shutdown the node with the active resources,
> > > heartbeat migrates the DRBD Master to the other node but the colocated
> > > group does not migrate (it remains stopped on the active node). I'm
> > 
> > That's no good. You should send logs/config.
> > 
> 
> I've attached cibadmin -Q and ha.cf. See below on question about logs.
> I'm going to try flipping the nodes around and repeating the shutdown of
> the "non-active" node.
> 
> 
> > > digging into that now. If I shutdown the node that does not have the
> > > active resources, the following happens:
> > > 
> > > (State: DC on active node1, running drbd master and group resources)
> > > shutdown node2
> > > demote attempted on node1 for drbd master,
> > 
> > Why demote? It's master running on a good node.
> > 
> 
> Don't know, this is what I observed. I wondered why it would do a demote
> when this node is already OK.
> 
> > > no attempt at halting groups
> > > resources that depend on drbd
> > 
> > Why should the resources be stopped? You shutdown a node which
> > doesn't have any resources.
> > 
> 
> Same here, don't know why taking down the node without resources would
> affect the other. One thing I keep coming back to is the <locate>
> constraint, and how it affects processing. Probably not an issue, but...
> 
> 
> > I'm not versatile in the master/slave business, so I can't comment
> > more. But something seems to be very broken: either your config
> > or you ran into a bug.
> > 
> 
> Just completed some more testing, and it gets more interesting (and
> probably supports your thought that a config might be broken somewhere):

Your config looks ok to me, though I can't vouch for the
constraints. Did you check http://www.linux-ha.org/CIB/Idioms?
Over there they are a bit more extensive than what you have.

> Node1 - DC, active resources, <locate> constraint to this node
> - executed heartbeat shutdown on node1 - all resources migrated, as well
> as DC, to node2
> - executed heartbeat startup on node1 - all resources migrate back to
> node1, DC stays on node2
> - executed heartbeat shutdown again on node1 - all resources migrated to
> node2
> - executed heartbeat startup again on node1 - all resources migrated
> back to node 1, DC stays on node2
> 
> However:
> Node2 - DC, active resources, <locate> constraint to this node
> - executed heartbeat shutdown on node2 - all resources stopped, shutdown
> takes about 14 minutes to complete, all resources migrated to node1
> - executed heartbeat startup on node2 - all resources stop on node1,
> file system resource within group resource flashes FAILED on crm_mon,
> drbd Master migrates back to node1, group resources stay in stopped
> state
> 
> So, there is definitely something going on different between the two
> nodes. I'll attach the cibadmin -Q and ha.cf files (minimal diff, just
> the IP addresses), can you suggest an "optimal" way of determining how
> much or what part of the ha.debug logs to capture? I am attaching the
> log from node2 where the shutdown took ~14 minutes.

This log is quite small. There's a huge delay of about 20 minutes,
that's when we have a timer popping. Two actions on drbd did not
finish, but I can't see which.

Regarding "how much logs": the more the better. Grab everything
and use compression.

> > > demote of drbd master fails due to "device held open" error, filesystem
> > > still has it mounted
> > > loops through continuously trying to demote drbd (spin condition)
> > > shutdown command never completes, control-C, then kill -9 main heartbeat
> > > on node1
> > > drbd:0 goes stopped, :1 Master goes FAILED, group resources all still
> > > show started
> > > startup command executed on node1, Bad Things Happen, eventually drbd
> > > goes unmanaged
> > > after node1 heartbeat startup completes, stop group and drbd, restart
> > > resources, everything comes up fine
> > > 
> > > I'm going to try a similar test, but using kill -9 right off the bat
> > > instead of the controlled shutdown. If there's any info I need to
> > > provide to make this clearer, please, anybody, just let me know.
> > > 
> > > Doug
> > > 
> > > On Thu, 2007-05-03 at 13:14 +0200, Dejan Muhamedagic wrote:
> > > 
> > > > On Fri, Apr 27, 2007 at 03:10:22PM -0400, Doug Knight wrote:
> > > > > I now have a working configuration with DRBD master/slave, and a
> > > > > filesystem/pgsql/ipaddr group following it around. So far, I've been
> > > > > using a Place constraint and modifying its uname value to test the 
> > > > > "fail
> > > > > over" of the resources. Can someone suggest a reasonable set of tests
> > > > > that most do to verify other possible error conditions (short of 
> > > > > pulling
> > > > > the plug on one of the servers)?
> > > > 
> > > > You can run CTS with your configuration. Otherwise, stopping
> > > > heartbeat in a way that it doesn't notice being stopped (kill -9)
> > > > simulates the "pull power plug" condition. You'd also want to
> > > > make various resources fail.
> > > > 
> > > > > Also, the Place constraint is on the
> > > > > DRBD master/slave, does that make sense or should it be placed on one 
> > > > > of
> > > > > the "higher level" resources like the file system or pgsql?
> > > > 
> > > > I don't think it matters, you can go with either, given that the
> > > > resources are collocated.
> > > > 
> > > > > Thanks,
> > > > > Doug
> > > > > 
> > > > > On Thu, 2007-04-26 at 09:45 -0400, Doug Knight wrote:
> > > > > 
> > > > > > Hi Alastair,
> > > > > > Have you encountered a situation where when you first start up the 
> > > > > > drbd
> > > > > > master/slave resource, crm_mon and/or the GUI indicate Master 
> > > > > > status on
> > > > > > one node, and Started status on the other (as opposed to Slave)? If 
> > > > > > so,
> > > > > > how did you correct it?
> > > > > > 
> > > > > > Doug
> > > > > > p.s. Thanks for the scripts and xml, they're a big help!
> > > > > > 
> > > 
> > > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > 






> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to