hello,
I have a problem with timeouts from my RA for virtual Xen machines.
Since xm commands are very slow, I have replaced them by some xentop and
pgrep commands to avoid timeouts from xm. Unfortunately, this didn't
solved the problem, as corosync/pacemaker gets a timeout requesting the
status of a resource and ends up in "Remote node did not respond" messages.
Any idea how to increase the performance of communication between the nodes?
This is an example config of a resource:
primitive VM_24_435 ocf:hostway:Xen \
params xmfile="/etc/xen/conf.d/VM_24_435.cfg" \
op start interval="0" timeout="60s" on-fail="restart" \
op stop interval="0" timeout="40s" \
meta target-role="Started" allow-migrate="true"
[...]
...and the basic configuration:
no-quorum-policy="ignore" \
stonith-enabled="false" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="openais" \
expected-quorum-votes="9" \
last-lrm-refresh="1327655374" \
symmetric-cluster="true" \
default-resource-stickiness="INFINITY" \
maintenance-mode="false" \
corosync/pacemaker is installed from debian backports:
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
9 Nodes configured, 9 expected votes
When I start a cleanup (crm resource cleanup VM_24_435) while the cib
process is already busy, this is what crm_mon tells me:
Failed actions:
VM_24_435_monitor_0 (node=vnode04, call=-1, rc=1, status=Timed
Out): unknown error
VM_24_435_monitor_0 (node=vnode06, call=-1, rc=1, status=Timed
Out): unknown error
The monitor request itself is quick:
:~$ time bash -x /usr/lib/ocf/resource.d/hostway/Xen monitor
+ . /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes
++ OCF_SUCCESS=0
++ OCF_ERR_GENERIC=1
++ OCF_ERR_ARGS=2
++ OCF_ERR_UNIMPLEMENTED=3
++ OCF_ERR_PERM=4
++ OCF_ERR_INSTALLED=5
++ OCF_ERR_CONFIGURED=6
++ OCF_NOT_RUNNING=7
++ OCF_RUNNING_MASTER=8
++ OCF_FAILED_MASTER=9
+ . /root/hep/scripts/funcs
+++ basename /usr/lib/ocf/resource.d/hostway/Xen
++ prog=Xen
++ cib=/tmp/cib
+ : /etc/xen/vm/MyDomU
+ DOMAIN_NAME=VM_24_435
+ '[' 1 -ne 1 ']'
+ case $1 in
+ '[' x '!=' x ']'
+ '[' -f /etc/xen/vm/MyDomU ']'
+ DOMAIN_NAME=VM_24_435
+ case $1 in
+ Xen_Monitor
+ Xen_Status VM_24_435
+ /usr/bin/pgrep -lf 'qemu-dmbin.*domain-name VM_24_435'
+ '[' 0 = 0 ']'
+ return 0
+ [[ 0 -eq 7 ]]
+ '[' X = X ']'
+ return 0
+ exit 0
real 0m0.122s
My corosync-config was debian default, I modified max_messages
(increased from 20 to 500), tried transport udpu and secauth with 8
threads on a 8 core machine, but still the cib process gets up to 100%
cpu usage in a 1Gig network.
Here's the logfile:
Jan 27 09:49:16 vnode10 crmd: [27990]: ERROR: send_msg_via_ipc: Unknown
Sub-system (9593_crm_resource)... discarding message.
Jan 27 09:49:18 vnode10 crmd: [27990]: ERROR: send_msg_via_ipc: Unknown
Sub-system (9593_crm_resource)... discarding message.
Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
Shutdown escalation occurs after: 1200000ms
Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
Checking for expired actions every 900000ms
Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
Sending expected-votes=9 to corosync
Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
Shutdown escalation occurs after: 1200000ms
Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
Checking for expired actions every 900000ms
Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
Sending expected-votes=9 to corosync
Jan 27 09:49:18 vnode10 crmd: [27990]: info: ais_dispatch_message:
Membership 6204: quorum retained
Jan 27 09:49:18 vnode10 crmd: [27990]: info: ais_dispatch_message:
Membership 6204: quorum retained
Jan 27 09:49:56 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing
key=17:117:7:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_monitor_0 )
Jan 27 09:49:56 vnode10 lrmd: [27987]: info: rsc:VM_24_435:48: probe
Jan 27 09:49:57 vnode10 crmd: [27990]: info: process_lrm_event: LRM
operation VM_24_435_monitor_0 (call=48, rc=0, cib-update=260,
confirmed=true) ok
Jan 27 09:49:58 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing
key=75:118:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_stop_0 )
Jan 27 09:49:58 vnode10 lrmd: [27987]: info: rsc:VM_24_435:49: stop
Jan 27 09:50:05 vnode10 crmd: [27990]: info: process_lrm_event: LRM
operation VM_24_435_stop_0 (call=49, rc=0, cib-update=261,
confirmed=true) ok
Jan 27 09:50:05 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing
key=76:118:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_start_0 )
Jan 27 09:50:05 vnode10 lrmd: [27987]: info: rsc:VM_24_435:50: start
Jan 27 09:50:13 vnode10 lrmd: [27987]: WARN: on_msg_get_metadata: empty
metadata for ocf::hostway::Xen.
Jan 27 09:50:13 vnode10 lrmd: [27987]: WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 200 ms (> 100 ms) before being called
(GSource: 0x2412430)
Jan 27 09:50:13 vnode10 lrmd: [27987]: info: G_SIG_dispatch: started at
1838690515 should have started at 1838690495
Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR:
lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
message of rmetadata with function get_ret_from_msg.
Jan 27 09:50:13 vnode10 crmd: [27990]: WARN: get_rsc_metadata: No
metadata found for Xen::ocf:hostway
Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: string2xml: Can't parse
NULL input
Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: get_rsc_restart_list:
Metadata for hostway::ocf:Xen is not valid XML
Jan 27 09:50:13 vnode10 crmd: [27990]: info: process_lrm_event: LRM
operation VM_24_435_start_0 (call=50, rc=0, cib-update=262,
confirmed=true) ok
Eventually, the cib gets unresponsable:
:~$ crm resource manage VM_24_435
Call cib_replace failed (-41): Remote node did not respond
<null>
Jan 27 10:22:14 nfscc01 cib: [1318]: info: cib_process_diff: Diff
0.12146.136 -> 0.12146.137 not applied to 0.12145.4: current "epoch" is
less than required
Jan 27 10:22:14 nfscc01 cib: [1318]: info: cib_server_process_diff:
Requesting re-sync from peer
Jan 27 10:22:14 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.137 -> 0.12146.138 (sync in progress)
Jan 27 10:22:14 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.138 -> 0.12146.139 (sync in progress)
Jan 27 10:23:10 nfscc01 crmd: [1322]: WARN: cib_rsc_callback: Resource
update 314 failed: (rc=-41) Remote node did not respond
Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.139 -> 0.12146.140 (sync in progress)
Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.140 -> 0.12146.141 (sync in progress)
Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.141 -> 0.12146.142 (sync in progress)
Jan 27 10:23:30 nfscc01 cib: [1318]: info: cib_process_diff: Diff
0.12146.142 -> 0.12146.143 not applied to 0.12145.4: current "epoch" is
less than required
Jan 27 10:23:30 nfscc01 cib: [1318]: info: cib_server_process_diff:
Requesting re-sync from peer
Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.143 -> 0.12146.144 (sync in progress)
Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.144 -> 0.12146.145 (sync in progress)
Jan 27 10:23:31 nfscc01 crmd: [1322]: info: do_lrm_rsc_op: Performing
key=3:161:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_stop_0 )
Jan 27 10:23:31 nfscc01 lrmd: [1319]: info: rsc:VM_24_435:65: stop
Jan 27 10:23:31 nfscc01 crmd: [1322]: info: process_lrm_event: LRM
operation VM_24_435_stop_0 (call=65, rc=0, cib-update=315,
confirmed=true) ok
Jan 27 10:23:41 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1594 for last-failure-VM_24_435=1327655960 failed: Remote node did not
respond
Jan 27 10:23:41 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1596 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1603 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1611 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1616 for last-failure-VM_24_435=1327655960 failed: Remote node did not
respond
Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1621 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1626 for last-failure-VM_24_435=1327655960 failed: Remote node did not
respond
Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1631 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1633 for last-failure-VM_24_435=1327655960 failed: Remote node did not
respond
Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1635 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1642 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1649 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1656 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1661 for last-failure-VM_24_435=1327655960 failed: Remote node did not
respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1666 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1671 for last-failure-VM_24_435=1327655960 failed: Remote node did not
respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1676 for probe_complete=true failed: Remote node did not respond
Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1684 for probe_complete=true failed: Remote node did not respond
Jan 27 10:24:04 nfscc01 attrd: [1320]: notice: attrd_ais_dispatch:
Update relayed from vnode06
Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.145 -> 0.12146.146 (sync in progress)
Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.145 -> 0.12146.146 (sync in progress)
Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff:
Not applying diff 0.12146.146 -> 0.12147.1 (sync in progress)
Jan 27 10:24:05 nfscc01 cib: [1318]: info: cib_process_diff: Diff
0.12147.1 -> 0.12147.2 not applied to 0.12145.4: current "epoch" is less
than required
Jan 27 10:24:05 nfscc01 cib: [1318]: info: cib_server_process_diff:
Requesting re-sync from peer
Jan 27 10:24:08 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
1691 for probe_complete=true failed: Remote node did not respond
Jan 27 10:24:31 nfscc01 crmd: [1322]: WARN: cib_rsc_callback: Resource
update 315 failed: (rc=-41) Remote node did not respond
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems