Re: [Linux-HA] RA timeouts / Remote node did not respond

Andrew Beekhof Mon, 30 Jan 2012 03:41:37 -0800

On Mon, Jan 30, 2012 at 9:28 PM, Sascha Reimann
<[email protected]> wrote:
> Hi Dejan,
>
> thanks for hints!
>
> It's indeed a bigger cluster with currently 9 nodes and approximately 50
> resources, but we've planned to build an even bigger one, which will
> probably not possible due to the frequent updates. The timeout for start
> and stop is set to 300s, monitor operations are just set for a few
> resources, but also with a timeout of 300s.
>
> This is the current CIB with 30 resources (I've deleted the 50
> productive ones for testing):
>
> :~$ cibadmin -Q|wc
>    1686    7802 193120


I've seen worse.

>
> I'm using location rules with variable scoring (example: rule free_ram:
> free_ram gte 4096 and free_cores gte 2) to determine to best location
> for a virtual machine. Could that required condition also cause updates
> of the CIB? Each node is writing its free resources to the CIB, which is
> definitely a reason and I will try to reduce them.

Right, this is the most likely cause

>
> ocf-tester was complaining about meta-data and permissions, which I've
> corrected.
> "crm configure verify" looks good, just a couple complains about a
> meta-data attributes I've set manually.
>
> Can you explain "client gone before receiving answer" a bit more?
>
> thanks!
> Sascha
>
> On 01/27/2012 08:05 PM, Dejan Muhamedagic wrote:
>> Hi,
>>
>> On Fri, Jan 27, 2012 at 10:49:52AM +0100, Sascha Reimann wrote:
>>> hello,
>>>
>>> I have a problem with timeouts from my RA for virtual Xen machines.
>>> Since xm commands are very slow, I have replaced them by some xentop and
>>> pgrep commands to avoid timeouts from xm. Unfortunately, this didn't
>>> solved the problem, as corosync/pacemaker gets a timeout requesting the
>>> status of a resource and ends up in "Remote node did not respond" messages.
>>
>> That has nothing to do with the RA performance.
>>
>>> Any idea how to increase the performance of communication between the nodes?
>>
>> Reduce the configuration, if possible. Split one big cluster into
>> two smaller ones, if possible. How many resources do you have?
>> How big your CIB gets (cibadmin -Q | wc)?
>>
>>> This is an example config of a resource:
>>>
>>> primitive VM_24_435 ocf:hostway:Xen \
>>>           params xmfile="/etc/xen/conf.d/VM_24_435.cfg" \
>>>           op start interval="0" timeout="60s" on-fail="restart" \
>>>           op stop interval="0" timeout="40s" \
>>>           meta target-role="Started" allow-migrate="true"
>>> [...]
>>>
>>> ...and the basic configuration:
>>>
>>>           no-quorum-policy="ignore" \
>>>           stonith-enabled="false" \
>>>           dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>>           cluster-infrastructure="openais" \
>>>           expected-quorum-votes="9" \
>>>           last-lrm-refresh="1327655374" \
>>>           symmetric-cluster="true" \
>>>           default-resource-stickiness="INFINITY" \
>>>           maintenance-mode="false" \
>>>
>>> corosync/pacemaker is installed from debian backports:
>>>
>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>> 9 Nodes configured, 9 expected votes
>>>
>>> When I start a cleanup (crm resource cleanup VM_24_435) while the cib
>>> process is already busy, this is what crm_mon tells me:
>>> Failed actions:
>>>       VM_24_435_monitor_0 (node=vnode04, call=-1, rc=1, status=Timed
>>> Out): unknown error
>>>       VM_24_435_monitor_0 (node=vnode06, call=-1, rc=1, status=Timed
>>> Out): unknown error
>>
>> This looks like a real monitor timeout, but not easy to tell
>> without looking at the configuration and logs. Are the timeouts
>> sufficiently big? Did you do crm configure verify?
>>
>>> The monitor request itself is quick:
>>>
>>> :~$ time bash -x /usr/lib/ocf/resource.d/hostway/Xen monitor
>>> + . /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes
>>> ++ OCF_SUCCESS=0
>>> ++ OCF_ERR_GENERIC=1
>>> ++ OCF_ERR_ARGS=2
>>> ++ OCF_ERR_UNIMPLEMENTED=3
>>> ++ OCF_ERR_PERM=4
>>> ++ OCF_ERR_INSTALLED=5
>>> ++ OCF_ERR_CONFIGURED=6
>>> ++ OCF_NOT_RUNNING=7
>>> ++ OCF_RUNNING_MASTER=8
>>> ++ OCF_FAILED_MASTER=9
>>> + . /root/hep/scripts/funcs
>>> +++ basename /usr/lib/ocf/resource.d/hostway/Xen
>>> ++ prog=Xen
>>> ++ cib=/tmp/cib
>>> + : /etc/xen/vm/MyDomU
>>> + DOMAIN_NAME=VM_24_435
>>> + '[' 1 -ne 1 ']'
>>> + case $1 in
>>> + '[' x '!=' x ']'
>>> + '[' -f /etc/xen/vm/MyDomU ']'
>>> + DOMAIN_NAME=VM_24_435
>>> + case $1 in
>>> + Xen_Monitor
>>> + Xen_Status VM_24_435
>>> + /usr/bin/pgrep -lf 'qemu-dmbin.*domain-name VM_24_435'
>>> + '[' 0 = 0 ']'
>>> + return 0
>>> + [[ 0 -eq 7 ]]
>>> + '[' X = X ']'
>>> + return 0
>>> + exit 0
>>>
>>> real    0m0.122s
>>>
>>> My corosync-config was debian default, I modified max_messages
>>> (increased from 20 to 500), tried transport udpu and secauth with 8
>>> threads on a 8 core machine, but still the cib process gets up to 100%
>>> cpu usage in a 1Gig network.
>>
>> The cib process is mostly processing the CIB, the kind of
>> transport won't have much impact. Such a high load is probably
>> coming from too frequent CIB updates.
>>
>>> Here's the logfile:
>>>
>>> Jan 27 09:49:16 vnode10 crmd: [27990]: ERROR: send_msg_via_ipc: Unknown
>>> Sub-system (9593_crm_resource)... discarding message.
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: ERROR: send_msg_via_ipc: Unknown
>>> Sub-system (9593_crm_resource)... discarding message.
>>
>> This is probably client gone before receiving answer.
>>
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
>>> Shutdown escalation occurs after: 1200000ms
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
>>> Checking for expired actions every 900000ms
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
>>> Sending expected-votes=9 to corosync
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
>>> Shutdown escalation occurs after: 1200000ms
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
>>> Checking for expired actions every 900000ms
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: config_query_callback:
>>> Sending expected-votes=9 to corosync
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: ais_dispatch_message:
>>> Membership 6204: quorum retained
>>> Jan 27 09:49:18 vnode10 crmd: [27990]: info: ais_dispatch_message:
>>> Membership 6204: quorum retained
>>> Jan 27 09:49:56 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing
>>> key=17:117:7:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_monitor_0 )
>>> Jan 27 09:49:56 vnode10 lrmd: [27987]: info: rsc:VM_24_435:48: probe
>>> Jan 27 09:49:57 vnode10 crmd: [27990]: info: process_lrm_event: LRM
>>> operation VM_24_435_monitor_0 (call=48, rc=0, cib-update=260,
>>> confirmed=true) ok
>>> Jan 27 09:49:58 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing
>>> key=75:118:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_stop_0 )
>>> Jan 27 09:49:58 vnode10 lrmd: [27987]: info: rsc:VM_24_435:49: stop
>>> Jan 27 09:50:05 vnode10 crmd: [27990]: info: process_lrm_event: LRM
>>> operation VM_24_435_stop_0 (call=49, rc=0, cib-update=261,
>>> confirmed=true) ok
>>> Jan 27 09:50:05 vnode10 crmd: [27990]: info: do_lrm_rsc_op: Performing
>>> key=76:118:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_start_0 )
>>> Jan 27 09:50:05 vnode10 lrmd: [27987]: info: rsc:VM_24_435:50: start
>>> Jan 27 09:50:13 vnode10 lrmd: [27987]: WARN: on_msg_get_metadata: empty
>>> metadata for ocf::hostway::Xen.
>>> Jan 27 09:50:13 vnode10 lrmd: [27987]: WARN: G_SIG_dispatch: Dispatch
>>> function for SIGCHLD was delayed 200 ms (>  100 ms) before being called
>>> (GSource: 0x2412430)
>>
>> Host too busy? Did you check uptime and performance statistics?
>>
>>> Jan 27 09:50:13 vnode10 lrmd: [27987]: info: G_SIG_dispatch: started at
>>> 1838690515 should have started at 1838690495
>>> Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR:
>>> lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
>>> message of rmetadata with function get_ret_from_msg.
>>> Jan 27 09:50:13 vnode10 crmd: [27990]: WARN: get_rsc_metadata: No
>>> metadata found for Xen::ocf:hostway
>>> Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: string2xml: Can't parse
>>> NULL input
>>> Jan 27 09:50:13 vnode10 crmd: [27990]: ERROR: get_rsc_restart_list:
>>> Metadata for hostway::ocf:Xen is not valid XML
>>
>> Your OCF RA is not OCF compliant. Fix it and use ocf-tester to
>> make sure that all required functionality is correct.
>>
>>> Jan 27 09:50:13 vnode10 crmd: [27990]: info: process_lrm_event: LRM
>>> operation VM_24_435_start_0 (call=50, rc=0, cib-update=262,
>>> confirmed=true) ok
>>>
>>>
>>> Eventually, the cib gets unresponsable:
>>>
>>> :~$ crm resource manage VM_24_435
>>> Call cib_replace failed (-41): Remote node did not respond
>>> <null>
>>>
>>> Jan 27 10:22:14 nfscc01 cib: [1318]: info: cib_process_diff: Diff
>>> 0.12146.136 ->  0.12146.137 not applied to 0.12145.4: current "epoch" is
>>> less than required
>>> Jan 27 10:22:14 nfscc01 cib: [1318]: info: cib_server_process_diff:
>>> Requesting re-sync from peer
>>> Jan 27 10:22:14 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.137 ->  0.12146.138 (sync in progress)
>>> Jan 27 10:22:14 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.138 ->  0.12146.139 (sync in progress)
>>> Jan 27 10:23:10 nfscc01 crmd: [1322]: WARN: cib_rsc_callback: Resource
>>> update 314 failed: (rc=-41) Remote node did not respond
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.139 ->  0.12146.140 (sync in progress)
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.140 ->  0.12146.141 (sync in progress)
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.141 ->  0.12146.142 (sync in progress)
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: info: cib_process_diff: Diff
>>> 0.12146.142 ->  0.12146.143 not applied to 0.12145.4: current "epoch" is
>>> less than required
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: info: cib_server_process_diff:
>>> Requesting re-sync from peer
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.143 ->  0.12146.144 (sync in progress)
>>> Jan 27 10:23:30 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.144 ->  0.12146.145 (sync in progress)
>>> Jan 27 10:23:31 nfscc01 crmd: [1322]: info: do_lrm_rsc_op: Performing
>>> key=3:161:0:4ac4f5c1-f61a-48f6-bef8-6b2223c973f2 op=VM_24_435_stop_0 )
>>> Jan 27 10:23:31 nfscc01 lrmd: [1319]: info: rsc:VM_24_435:65: stop
>>> Jan 27 10:23:31 nfscc01 crmd: [1322]: info: process_lrm_event: LRM
>>> operation VM_24_435_stop_0 (call=65, rc=0, cib-update=315,
>>> confirmed=true) ok
>>> Jan 27 10:23:41 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1594 for last-failure-VM_24_435=1327655960 failed: Remote node did not
>>> respond
>>> Jan 27 10:23:41 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1596 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1603 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1611 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1616 for last-failure-VM_24_435=1327655960 failed: Remote node did not
>>> respond
>>> Jan 27 10:23:42 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1621 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1626 for last-failure-VM_24_435=1327655960 failed: Remote node did not
>>> respond
>>> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1631 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1633 for last-failure-VM_24_435=1327655960 failed: Remote node did not
>>> respond
>>> Jan 27 10:23:43 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1635 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1642 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1649 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1656 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1661 for last-failure-VM_24_435=1327655960 failed: Remote node did not
>>> respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1666 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1671 for last-failure-VM_24_435=1327655960 failed: Remote node did not
>>> respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1676 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:23:44 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1684 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:24:04 nfscc01 attrd: [1320]: notice: attrd_ais_dispatch:
>>> Update relayed from vnode06
>>> Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.145 ->  0.12146.146 (sync in progress)
>>> Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.145 ->  0.12146.146 (sync in progress)
>>> Jan 27 10:24:04 nfscc01 cib: [1318]: notice: cib_server_process_diff:
>>> Not applying diff 0.12146.146 ->  0.12147.1 (sync in progress)
>>> Jan 27 10:24:05 nfscc01 cib: [1318]: info: cib_process_diff: Diff
>>> 0.12147.1 ->  0.12147.2 not applied to 0.12145.4: current "epoch" is less
>>> than required
>>> Jan 27 10:24:05 nfscc01 cib: [1318]: info: cib_server_process_diff:
>>> Requesting re-sync from peer
>>> Jan 27 10:24:08 nfscc01 attrd: [1320]: WARN: attrd_cib_callback: Update
>>> 1691 for probe_complete=true failed: Remote node did not respond
>>> Jan 27 10:24:31 nfscc01 crmd: [1322]: WARN: cib_rsc_callback: Resource
>>> update 315 failed: (rc=-41) Remote node did not respond
>>
>> It could be that your CIB is very large and very often updated
>> which brings cib to a crawl. You can also try to reduce the
>> interval of monitoring operations.
>>
>> Thanks,
>>
>> Dejan
>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] RA timeouts / Remote node did not respond

Reply via email to