Re: [Linux-HA] RA timeouts / Remote node did not respond

Dejan Muhamedagic Fri, 27 Jan 2012 11:05:52 -0800

Hi,

On Fri, Jan 27, 2012 at 10:49:52AM +0100, Sascha Reimann wrote:
> hello,
> 
> I have a problem with timeouts from my RA for virtual Xen machines. 
> Since xm commands are very slow, I have replaced them by some xentop and 
> pgrep commands to avoid timeouts from xm. Unfortunately, this didn't 
> solved the problem, as corosync/pacemaker gets a timeout requesting the 
> status of a resource and ends up in "Remote node did not respond" messages.


That has nothing to do with the RA performance.

> Any idea how to increase the performance of communication between the nodes?

Reduce the configuration, if possible. Split one big cluster into
two smaller ones, if possible. How many resources do you have?
How big your CIB gets (cibadmin -Q | wc)?

> This is an example config of a resource:
> 
> primitive VM_24_435 ocf:hostway:Xen \
>          params xmfile="/etc/xen/conf.d/VM_24_435.cfg" \
>          op start interval="0" timeout="60s" on-fail="restart" \
>          op stop interval="0" timeout="40s" \
>          meta target-role="Started" allow-migrate="true"
> [...]
> 
> ...and the basic configuration:
> 
>          no-quorum-policy="ignore" \
>          stonith-enabled="false" \
>          dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>          cluster-infrastructure="openais" \
>          expected-quorum-votes="9" \
>          last-lrm-refresh="1327655374" \
>          symmetric-cluster="true" \
>          default-resource-stickiness="INFINITY" \
>          maintenance-mode="false" \
> 
> corosync/pacemaker is installed from debian backports:
> 
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 9 Nodes configured, 9 expected votes
> 
> When I start a cleanup (crm resource cleanup VM_24_435) while the cib 
> process is already busy, this is what crm_mon tells me:
> Failed actions:
>      VM_24_435_monitor_0 (node=vnode04, call=-1, rc=1, status=Timed 
> Out): unknown error
>      VM_24_435_monitor_0 (node=vnode06, call=-1, rc=1, status=Timed 
> Out): unknown error

This looks like a real monitor timeout, but not easy to tell
without looking at the configuration and logs. Are the timeouts
sufficiently big? Did you do crm configure verify?

> The monitor request itself is quick:
> 
> :~$ time bash -x /usr/lib/ocf/resource.d/hostway/Xen monitor
> + . /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes
> ++ OCF_SUCCESS=0
> ++ OCF_ERR_GENERIC=1
> ++ OCF_ERR_ARGS=2
> ++ OCF_ERR_UNIMPLEMENTED=3
> ++ OCF_ERR_PERM=4
> ++ OCF_ERR_INSTALLED=5
> ++ OCF_ERR_CONFIGURED=6
> ++ OCF_NOT_RUNNING=7
> ++ OCF_RUNNING_MASTER=8
> ++ OCF_FAILED_MASTER=9
> + . /root/hep/scripts/funcs
> +++ basename /usr/lib/ocf/resource.d/hostway/Xen
> ++ prog=Xen
> ++ cib=/tmp/cib
> + : /etc/xen/vm/MyDomU
> + DOMAIN_NAME=VM_24_435
> + '[' 1 -ne 1 ']'
> + case $1 in
> + '[' x '!=' x ']'
> + '[' -f /etc/xen/vm/MyDomU ']'
> + DOMAIN_NAME=VM_24_435
> + case $1 in
> + Xen_Monitor
> + Xen_Status VM_24_435
> + /usr/bin/pgrep -lf 'qemu-dmbin.*domain-name VM_24_435'
> + '[' 0 = 0 ']'
> + return 0
> + [[ 0 -eq 7 ]]
> + '[' X = X ']'
> + return 0
> + exit 0
> 
> real    0m0.122s
> 
> My corosync-config was debian default, I modified max_messages 
> (increased from 20 to 500), tried transport udpu and secauth with 8 
> threads on a 8 core machine, but still the cib process gets up to 100% 
> cpu usage in a 1Gig network.

The cib process is mostly processing the CIB, the kind of
transport won't have much impact. Such a high load is probably
coming from too frequent CIB updates.

> Here's the logfile:
> 
> Jan 27 09:49:16 vnode10 crmd: [27990]: ERROR: send_msg_via_ipc: Unknown 
> Sub-system (9593_crm_resource)... discarding message.
> Jan 27 09:49:18 vnode10 crmd: [27990]: ERROR: send_msg_via_ipc: Unknown 
> Sub-system (9593_crm_resource)... discarding message.

This is probably client gone before receiving answer.

> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback: 
> Shutdown escalation occurs after: 1200000ms
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback: 
> Checking for expired actions every 900000ms
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback: 
> Sending expected-votes=9 to corosync
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback: 
> Shutdown escalation occurs after: 1200000ms
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback: 
> Checking for expired actions every 900000ms
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback: 
> Sending expected-votes=9 to corosync
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: ais_dispatch_message: 
> Membership 6204: quorum retained
> Jan 27 09:49:18 vnode10 crmd: [27990]: info: ais_dispatch_message: 
> Membership 6204: quorum retained
> Jan 27 09:49:56 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing 
> key=17:117:7:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_monitor_0 )
> Jan 27 09:49:56 vnode10 lrmd: [27987]: info: rsc:VM_24_435:48: probe
> Jan 27 09:49:57 vnode10 crmd: [27990]: info: process_lrm_event: LRM 
> operation VM_24_435_monitor_0 (call=48, rc=0, cib-update=260, 
> confirmed=true) ok
> Jan 27 09:49:58 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing 
> key=75:118:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_stop_0 )
> Jan 27 09:49:58 vnode10 lrmd: [27987]: info: rsc:VM_24_435:49: stop
> Jan 27 09:50:05 vnode10 crmd: [27990]: info: process_lrm_event: LRM 
> operation VM_24_435_stop_0 (call=49, rc=0, cib-update=261, 
> confirmed=true) ok
> Jan 27 09:50:05 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing 
> key=76:118:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_start_0 )
> Jan 27 09:50:05 vnode10 lrmd: [27987]: info: rsc:VM_24_435:50: start
> Jan 27 09:50:13 vnode10 lrmd: [27987]: WARN: on_msg_get_metadata: empty 
> metadata for ocf::hostway::Xen.
> Jan 27 09:50:13 vnode10 lrmd: [27987]: WARN: G_SIG_dispatch: Dispatch 
> function for SIGCHLD was delayed 200 ms (> 100 ms) before being called 
> (GSource: 0x2412430)

Host too busy? Did you check uptime and performance statistics?

> Jan 27 09:50:13 vnode10 lrmd: [27987]: info: G_SIG_dispatch: started at 
> 1838690515 should have started at 1838690495
> Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: 
> lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply 
> message of rmetadata with function get_ret_from_msg.
> Jan 27 09:50:13 vnode10 crmd: [27990]: WARN: get_rsc_metadata: No 
> metadata found for Xen::ocf:hostway
> Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: string2xml: Can't parse 
> NULL input
> Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: get_rsc_restart_list: 
> Metadata for hostway::ocf:Xen is not valid XML

Your OCF RA is not OCF compliant. Fix it and use ocf-tester to
make sure that all required functionality is correct.

> Jan 27 09:50:13 vnode10 crmd: [27990]: info: process_lrm_event: LRM 
> operation VM_24_435_start_0 (call=50, rc=0, cib-update=262, 
> confirmed=true) ok
> 
> 
> Eventually, the cib gets unresponsable:
> 
> :~$ crm resource manage VM_24_435
> Call cib_replace failed (-41): Remote node did not respond
> <null>
> 
> Jan 27 10:22:14 nfscc01 cib: [1318]: info: cib_process_diff: Diff 
> 0.12146.136 -> 0.12146.137 not applied to 0.12145.4: current "epoch" is 
> less than required
> Jan 27 10:22:14 nfscc01 cib: [1318]: info: cib_server_process_diff: 
> Requesting re-sync from peer
> Jan 27 10:22:14 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.137 -> 0.12146.138 (sync in progress)
> Jan 27 10:22:14 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.138 -> 0.12146.139 (sync in progress)
> Jan 27 10:23:10 nfscc01 crmd: [1322]: WARN: cib_rsc_callback: Resource 
> update 314 failed: (rc=-41) Remote node did not respond
> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.139 -> 0.12146.140 (sync in progress)
> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.140 -> 0.12146.141 (sync in progress)
> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.141 -> 0.12146.142 (sync in progress)
> Jan 27 10:23:30 nfscc01 cib: [1318]: info: cib_process_diff: Diff 
> 0.12146.142 -> 0.12146.143 not applied to 0.12145.4: current "epoch" is 
> less than required
> Jan 27 10:23:30 nfscc01 cib: [1318]: info: cib_server_process_diff: 
> Requesting re-sync from peer
> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.143 -> 0.12146.144 (sync in progress)
> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.144 -> 0.12146.145 (sync in progress)
> Jan 27 10:23:31 nfscc01 crmd: [1322]: info: do_lrm_rsc_op: Performing 
> key=3:161:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_stop_0 )
> Jan 27 10:23:31 nfscc01 lrmd: [1319]: info: rsc:VM_24_435:65: stop
> Jan 27 10:23:31 nfscc01 crmd: [1322]: info: process_lrm_event: LRM 
> operation VM_24_435_stop_0 (call=65, rc=0, cib-update=315, 
> confirmed=true) ok
> Jan 27 10:23:41 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1594 for last-failure-VM_24_435=1327655960 failed: Remote node did not 
> respond
> Jan 27 10:23:41 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1596 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1603 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1611 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1616 for last-failure-VM_24_435=1327655960 failed: Remote node did not 
> respond
> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1621 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1626 for last-failure-VM_24_435=1327655960 failed: Remote node did not 
> respond
> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1631 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1633 for last-failure-VM_24_435=1327655960 failed: Remote node did not 
> respond
> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1635 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1642 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1649 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1656 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1661 for last-failure-VM_24_435=1327655960 failed: Remote node did not 
> respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1666 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1671 for last-failure-VM_24_435=1327655960 failed: Remote node did not 
> respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1676 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1684 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:24:04 nfscc01 attrd: [1320]: notice: attrd_ais_dispatch: 
> Update relayed from vnode06
> Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.145 -> 0.12146.146 (sync in progress)
> Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.145 -> 0.12146.146 (sync in progress)
> Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff: 
> Not applying diff 0.12146.146 -> 0.12147.1 (sync in progress)
> Jan 27 10:24:05 nfscc01 cib: [1318]: info: cib_process_diff: Diff 
> 0.12147.1 -> 0.12147.2 not applied to 0.12145.4: current "epoch" is less 
> than required
> Jan 27 10:24:05 nfscc01 cib: [1318]: info: cib_server_process_diff: 
> Requesting re-sync from peer
> Jan 27 10:24:08 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update 
> 1691 for probe_complete=true failed: Remote node did not respond
> Jan 27 10:24:31 nfscc01 crmd: [1322]: WARN: cib_rsc_callback: Resource 
> update 315 failed: (rc=-41) Remote node did not respond

It could be that your CIB is very large and very often updated
which brings cib to a crawl. You can also try to reduce the
interval of monitoring operations.

Thanks,

Dejan

> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] RA timeouts / Remote node did not respond

Reply via email to