On Tue, Sep 20, 2011 at 5:59 PM, RaSca <[email protected]> wrote: > Hi all, > I start a new thread because I've got more debug details to analyze my > situation, and starting from the beginning might be better. > > My environment is composed by two machine connected to a network and one > to each other. The cluster runs a lot of virtual machines, each one > based upon a dual primary drbd. The two systems are Debian Squeeze with > backports: > > kernel 2.6.39-3 > drbd 8.3.10-1 > corosync 1.3.0-3 > pacemaker 1.0.11-1 > libvirt-bin 0.9.2-7 > > The (dual-primary) drbd resources are declared in this way: > > primitive vm-1_r0 ocf:linbit:drbd \ > params drbd_resource="r0" \ > op monitor interval="20s" role="Master" timeout="20s" \ > op monitor interval="30s" role="Slave" timeout="20s" \ > op start interval="0" timeout="240s" \ > op stop interval="0" timeout="100s" > > ms vm-1_ms-r0 vm-1_r0 \ > meta notify="true" master-max="2" clone-max="2" interleave="true" > > and the virtual machine are like this: > > primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \ > params config="/etc/libvirt/qemu/vm-1.xml" hypervisor="qemu:///system" > migration_transport="ssh" force_stop="true" \ > meta allow-migrate="true" \ > op monitor interval="10s" timeout="30s" on-fail="restart" depth="0" \ > op start interval="0" timeout="120s" \ > op stop interval="0" timeout="120s" > > There are colocation and order for each vm: > > colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master > order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start > > And there is a location constraint for the connectivity: > > location vm-1_ON_CONNECTED_NODE vm-1 \ > rule $id="vm-1_ON_CONNECTED_NODE-rule" -inf: not_defined ping or ping > lte 0 > > The problem is that every night I've scheduled a live migration of a vm, > but if this fails, then the node gets fenced, even if the on-fail > parameter of the vm is set to "restart". > Everything starts at 23: > > Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource > -M -r vm-1 > > Two seconds later the first problem: > > Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation > monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148, > its parameters: hypervisor=[qemu:///system] CRM_m > eta_depth=[0] CRM_meta_timeout=[30000] force_stop=[true] > config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0] > crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito > r] migration_transport=[ssh] CRM_meta_interval=[10000] cancelled > > why this operation is marked ad cancelled?
Hard to tell from just one log message. My guess though, since its a recurring operation, is that we're about to run stop or migrate_from for the resource - before which we cancel all recurring monitor ops. > Anyway, after 22 seconds, the > operation fails with "Timed Out": > > Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM > operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=20000ms) No, this is a completely independent operation to the one being cancelled. Is 20s enough time to migrate the VM to another machine? > Force shutdown is invoked: > > Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced > shutdown (destroy) request for domain vm-1. > > and even if the vm appears to be destroyed (the kernel messages confirm > the the vmnet devices were destroyed), the RA seems to ignore it: > > Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: > (vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1 > Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: > (vm-1_virtualdomain:stop:stderr) error: Requested operation is not > valid: domain is not running > Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM > operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445, > confirmed=true) unknown error The RA isn't ignoring it, its reporting that state as an error instead of OCF_NOT_RUNNING which is probably more appropriate. > > In the meantime on the other node, since some errors are discovered: > > Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: > Migrating vm-1_virtualdomain from node-2 to node-1 > Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: > Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (1000000) > ... > ... > Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing > failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2) > Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing > failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1) > Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2 > will be fenced to recover from resource failure(s) Right, so stop failed too... hence the fencing. > > a STONITH is invoked... > > Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid: > 2314] requests a STONITH operation RESET on node node-2 > > ...with success: > > Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the > node node-2: optype=RESET. whodoit: node-1 > > My conclusions are: > > 1 - the fence has nothing to do with drbd (there is no mention to it > until the reset is done); > > 2 - for some reason live migrating the vms SOMETIMES fails, even if once > the system has recovered I can do a crm resource move vm-1 with ANY problem. > > 3 - Even if the vm fails to stop the cluster does not try to restart it, > but simply fence the node, and this is not what the on-fail parameter is > meant to do. /stop/ failed, your on-fail setting only applies to the /monitor/ operation > > Does someone have some suggestions on how to debug more this problem? > Please help! > > Thanks a lot, > > -- > RaSca > Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! > [email protected] > http://www.miamammausalinux.org > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
