Hi,

On Tue, Dec 28, 2010 at 03:18:06PM -0700, Greg Woods wrote:
> I updated one of my clusters today, and among other things, I updated
> from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related
> or not.
> 
> The problem is that I cannot get the cluster to come up clean. Right now
> all resources are running on one node and it is OK that way. As soon as
> I start heartbeat on the second node, it goes into a stonith death
> match. What I see is some failed actions involving trying to stop a DRBD
> resource group. Here is a log snippet:
> 
> Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
> Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
> op=vmgroup1:0_stop_0 )
> Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
> rsc:vmgroup1:0:30: stop
> Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
> Performing key=50:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
> op=vmgroup2:0_stop_0 )
> Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
> rsc:vmgroup2:0:31: stop
> Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed
> vmgroup1:0:stop process 8088 exited with return code 6.
> Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info:
> process_lrm_event: LRM operation vmgroup1:0_stop_0 (call=30, rc=6,
> cib-update=36, confirmed=true) not configured
> Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed
> vmgroup2:0:stop process 8089 exited with return code 6.

No messages from the drbd RA? It should be quite loud in this
case. This smells like a bug found in 1.0.9 which should've been
fixed a while ago:

http://developerbugs.linux-foundation.org/show_bug.cgi?id=2458

> In this example, "vmgroup1" and "vmgroup2" are DRBD resources, then set
> up as clones, which is the standard way to do this. Looks like this in
> crm shell:
> 
> primitive vmgroup1 ocf:linbit:drbd \
>         params drbd_resource="vmgroup1" \
>         op monitor interval="59s" role="Master" timeout="30s" \
>         op monitor interval="60s" role="Slave" timeout="20s" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="100s"
> [...]
> ms ms-vmgroup1 vmgroup1 \
>         meta clone-max="2" notify="true" globally-unique="false"
> target-role="Started"
> 
> This has always worked fine until today. 
> 
> Any ideas what I can do to further debug this?

If it's not a resource problem (i.e. drbd), please either reopen
the bugzilla above or open a new one if it looks like a different
problem. Don't forget to attach hb_report.

Thanks,

Dejan

> I am running on CentOS 5.5 using the clusterlabs repos.
> 
> --Greg
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to