Hi, On Tue, Dec 28, 2010 at 03:18:06PM -0700, Greg Woods wrote: > I updated one of my clusters today, and among other things, I updated > from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related > or not. > > The problem is that I cannot get the cluster to come up clean. Right now > all resources are running on one node and it is OK that way. As soon as > I start heartbeat on the second node, it goes into a stonith death > match. What I see is some failed actions involving trying to stop a DRBD > resource group. Here is a log snippet: > > Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: > Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 > op=vmgroup1:0_stop_0 ) > Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info: > rsc:vmgroup1:0:30: stop > Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: > Performing key=50:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 > op=vmgroup2:0_stop_0 ) > Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info: > rsc:vmgroup2:0:31: stop > Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed > vmgroup1:0:stop process 8088 exited with return code 6. > Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: > process_lrm_event: LRM operation vmgroup1:0_stop_0 (call=30, rc=6, > cib-update=36, confirmed=true) not configured > Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed > vmgroup2:0:stop process 8089 exited with return code 6.
No messages from the drbd RA? It should be quite loud in this case. This smells like a bug found in 1.0.9 which should've been fixed a while ago: http://developerbugs.linux-foundation.org/show_bug.cgi?id=2458 > In this example, "vmgroup1" and "vmgroup2" are DRBD resources, then set > up as clones, which is the standard way to do this. Looks like this in > crm shell: > > primitive vmgroup1 ocf:linbit:drbd \ > params drbd_resource="vmgroup1" \ > op monitor interval="59s" role="Master" timeout="30s" \ > op monitor interval="60s" role="Slave" timeout="20s" \ > op start interval="0" timeout="240s" \ > op stop interval="0" timeout="100s" > [...] > ms ms-vmgroup1 vmgroup1 \ > meta clone-max="2" notify="true" globally-unique="false" > target-role="Started" > > This has always worked fine until today. > > Any ideas what I can do to further debug this? If it's not a resource problem (i.e. drbd), please either reopen the bugzilla above or open a new one if it looks like a different problem. Don't forget to attach hb_report. Thanks, Dejan > I am running on CentOS 5.5 using the clusterlabs repos. > > --Greg > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
