Hi all,
I start a new thread because I've got more debug details to analyze my
situation, and starting from the beginning might be better.
My environment is composed by two machine connected to a network and one
to each other. The cluster runs a lot of virtual machines, each one
based upon a dual primary drbd. The two systems are Debian Squeeze with
backports:
kernel 2.6.39-3
drbd 8.3.10-1
corosync 1.3.0-3
pacemaker 1.0.11-1
libvirt-bin 0.9.2-7
The (dual-primary) drbd resources are declared in this way:
primitive vm-1_r0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="20s" role="Master" timeout="20s" \
op monitor interval="30s" role="Slave" timeout="20s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
ms vm-1_ms-r0 vm-1_r0 \
meta notify="true" master-max="2" clone-max="2" interleave="true"
and the virtual machine are like this:
primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/vm-1.xml" hypervisor="qemu:///system"
migration_transport="ssh" force_stop="true" \
meta allow-migrate="true" \
op monitor interval="10s" timeout="30s" on-fail="restart" depth="0" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s"
There are colocation and order for each vm:
colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master
order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start
And there is a location constraint for the connectivity:
location vm-1_ON_CONNECTED_NODE vm-1 \
rule $id="vm-1_ON_CONNECTED_NODE-rule" -inf: not_defined ping or ping
lte 0
The problem is that every night I've scheduled a live migration of a vm,
but if this fails, then the node gets fenced, even if the on-fail
parameter of the vm is set to "restart".
Everything starts at 23:
Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource
-M -r vm-1
Two seconds later the first problem:
Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation
monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148,
its parameters: hypervisor=[qemu:///system] CRM_m
eta_depth=[0] CRM_meta_timeout=[30000] force_stop=[true]
config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0]
crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito
r] migration_transport=[ssh] CRM_meta_interval=[10000] cancelled
why this operation is marked ad cancelled? Anyway, after 22 seconds, the
operation fails with "Timed Out":
Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM
operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=20000ms)
Force shutdown is invoked:
Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced
shutdown (destroy) request for domain vm-1.
and even if the vm appears to be destroyed (the kernel messages confirm
the the vmnet devices were destroyed), the RA seems to ignore it:
Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output:
(vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1
Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output:
(vm-1_virtualdomain:stop:stderr) error: Requested operation is not
valid: domain is not running
Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM
operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445,
confirmed=true) unknown error
In the meantime on the other node, since some errors are discovered:
Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload:
Migrating vm-1_virtualdomain from node-2 to node-1
Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload:
Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (1000000)
...
...
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing
failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2)
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing
failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1)
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2
will be fenced to recover from resource failure(s)
a STONITH is invoked...
Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid:
2314] requests a STONITH operation RESET on node node-2
...with success:
Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the
node node-2: optype=RESET. whodoit: node-1
My conclusions are:
1 - the fence has nothing to do with drbd (there is no mention to it
until the reset is done);
2 - for some reason live migrating the vms SOMETIMES fails, even if once
the system has recovered I can do a crm resource move vm-1 with ANY problem.
3 - Even if the vm fails to stop the cluster does not try to restart it,
but simply fence the node, and this is not what the on-fail parameter is
meant to do.
Does someone have some suggestions on how to debug more this problem?
Please help!
Thanks a lot,
--
RaSca
Mia Mamma Usa Linux: Niente รจ impossibile da capire, se lo spieghi bene!
[email protected]
http://www.miamammausalinux.org
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems