On Thu, May 03, 2007 at 11:00:48AM -0400, Doug Knight wrote: > On Thu, 2007-05-03 at 16:12 +0200, Dejan Muhamedagic wrote: > > > On Thu, May 03, 2007 at 09:08:12AM -0400, Doug Knight wrote: > > > Thanks Dejan, I'll try the kill -9. One thing I'm seeing is that I can > > > easily move the resources between nodes using the <location> constraint, > > > but if I shutdown heartbeat on one node (/etc/init.d/heartbeat stop) I > > > run into problems. If I shutdown the node with the active resources, > > > heartbeat migrates the DRBD Master to the other node but the colocated > > > group does not migrate (it remains stopped on the active node). I'm > > > > That's no good. You should send logs/config. > > > > I've attached cibadmin -Q and ha.cf. See below on question about logs. > I'm going to try flipping the nodes around and repeating the shutdown of > the "non-active" node. > > > > > digging into that now. If I shutdown the node that does not have the > > > active resources, the following happens: > > > > > > (State: DC on active node1, running drbd master and group resources) > > > shutdown node2 > > > demote attempted on node1 for drbd master, > > > > Why demote? It's master running on a good node. > > > > Don't know, this is what I observed. I wondered why it would do a demote > when this node is already OK. > > > > no attempt at halting groups > > > resources that depend on drbd > > > > Why should the resources be stopped? You shutdown a node which > > doesn't have any resources. > > > > Same here, don't know why taking down the node without resources would > affect the other. One thing I keep coming back to is the <locate> > constraint, and how it affects processing. Probably not an issue, but... > > > > I'm not versatile in the master/slave business, so I can't comment > > more. But something seems to be very broken: either your config > > or you ran into a bug. > > > > Just completed some more testing, and it gets more interesting (and > probably supports your thought that a config might be broken somewhere):
Your config looks ok to me, though I can't vouch for the constraints. Did you check http://www.linux-ha.org/CIB/Idioms? Over there they are a bit more extensive than what you have. > Node1 - DC, active resources, <locate> constraint to this node > - executed heartbeat shutdown on node1 - all resources migrated, as well > as DC, to node2 > - executed heartbeat startup on node1 - all resources migrate back to > node1, DC stays on node2 > - executed heartbeat shutdown again on node1 - all resources migrated to > node2 > - executed heartbeat startup again on node1 - all resources migrated > back to node 1, DC stays on node2 > > However: > Node2 - DC, active resources, <locate> constraint to this node > - executed heartbeat shutdown on node2 - all resources stopped, shutdown > takes about 14 minutes to complete, all resources migrated to node1 > - executed heartbeat startup on node2 - all resources stop on node1, > file system resource within group resource flashes FAILED on crm_mon, > drbd Master migrates back to node1, group resources stay in stopped > state > > So, there is definitely something going on different between the two > nodes. I'll attach the cibadmin -Q and ha.cf files (minimal diff, just > the IP addresses), can you suggest an "optimal" way of determining how > much or what part of the ha.debug logs to capture? I am attaching the > log from node2 where the shutdown took ~14 minutes. This log is quite small. There's a huge delay of about 20 minutes, that's when we have a timer popping. Two actions on drbd did not finish, but I can't see which. Regarding "how much logs": the more the better. Grab everything and use compression. > > > demote of drbd master fails due to "device held open" error, filesystem > > > still has it mounted > > > loops through continuously trying to demote drbd (spin condition) > > > shutdown command never completes, control-C, then kill -9 main heartbeat > > > on node1 > > > drbd:0 goes stopped, :1 Master goes FAILED, group resources all still > > > show started > > > startup command executed on node1, Bad Things Happen, eventually drbd > > > goes unmanaged > > > after node1 heartbeat startup completes, stop group and drbd, restart > > > resources, everything comes up fine > > > > > > I'm going to try a similar test, but using kill -9 right off the bat > > > instead of the controlled shutdown. If there's any info I need to > > > provide to make this clearer, please, anybody, just let me know. > > > > > > Doug > > > > > > On Thu, 2007-05-03 at 13:14 +0200, Dejan Muhamedagic wrote: > > > > > > > On Fri, Apr 27, 2007 at 03:10:22PM -0400, Doug Knight wrote: > > > > > I now have a working configuration with DRBD master/slave, and a > > > > > filesystem/pgsql/ipaddr group following it around. So far, I've been > > > > > using a Place constraint and modifying its uname value to test the > > > > > "fail > > > > > over" of the resources. Can someone suggest a reasonable set of tests > > > > > that most do to verify other possible error conditions (short of > > > > > pulling > > > > > the plug on one of the servers)? > > > > > > > > You can run CTS with your configuration. Otherwise, stopping > > > > heartbeat in a way that it doesn't notice being stopped (kill -9) > > > > simulates the "pull power plug" condition. You'd also want to > > > > make various resources fail. > > > > > > > > > Also, the Place constraint is on the > > > > > DRBD master/slave, does that make sense or should it be placed on one > > > > > of > > > > > the "higher level" resources like the file system or pgsql? > > > > > > > > I don't think it matters, you can go with either, given that the > > > > resources are collocated. > > > > > > > > > Thanks, > > > > > Doug > > > > > > > > > > On Thu, 2007-04-26 at 09:45 -0400, Doug Knight wrote: > > > > > > > > > > > Hi Alastair, > > > > > > Have you encountered a situation where when you first start up the > > > > > > drbd > > > > > > master/slave resource, crm_mon and/or the GUI indicate Master > > > > > > status on > > > > > > one node, and Started status on the other (as opposed to Slave)? If > > > > > > so, > > > > > > how did you correct it? > > > > > > > > > > > > Doug > > > > > > p.s. Thanks for the scripts and xml, they're a big help! > > > > > > > > > > > > > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems -- Dejan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
