Dear all, as I set up a 2-nodes cluster using heartbeat, I encounter an annoying problem related to the monitoring process of drbd. A little research on the net gave me two pointers, albeit without an answer to me:
<http://www.mail-archive.com/[email protected]/msg08716.html> --> "[Linux-HA] More monitor failures near G_SIG_dispatch delays" <http://lists.community.tummy.com/pipermail/linux-ha/2008-September/034489.html> --> "[Linux-HA] problem with drbd verify and heartbeat?" First some notes about the environment. Both nodes are running pretty identical hardware with Novell SLES 10 SP2 installed. - Kernel is "2.6.16.60-0.29-smp" (SLES packages) - Heartbeat is "HA Version 2.1.4, CRM Version 2.0 (CIB feature set 2.0)" (SLES packages) - DRBD is "Version: 0.7.22 (api:79) SVN Revision: 2572" (SLES packages) The nodes are running a classical failover setup with one partition shared via DRBD, holding mainly Squid caches and local Bind DNS zones. I put the full configuration on my webspace (please let me know if that's not enough): cib.xml <http://www.basquiat.de/files/ha/cib.xml> ha.cf <http://www.basquiat.de/files/ha/ha.cf> drbd.conf <http://www.basquiat.de/files/ha/drbd.conf> Every few weeks, the drbd-Resource-Monitor times out and forces a failover. The most relevant log entries (please also let me know if you need more): ------------------------------------------------------------------------- This is host "jupiter2", master: [...] heartbeat[3404]: 2008/10/28_14:54:16 info: MSG stats: 0/0 ms age 1570438646 [pid3431/HBREAD] heartbeat[3404]: 2008/10/28_14:54:16 info: cl_malloc stats: 0/0 0/0 [pid3431/HBREAD] heartbeat[3404]: 2008/10/28_14:54:16 info: RealMalloc stats: 0 total malloc bytes. pid [3431/HBREAD] heartbeat[3404]: 2008/10/28_14:54:16 info: Current arena value: 0 heartbeat[3404]: 2008/10/28_14:54:16 info: These are nothing to worry about. [... now things get worse] lrmd[3567]: 2008/10/28_15:00:59 WARN: resource_drbd0:1:monitor process (PID 10459) timed out (try 1). Killing with signal SIGTERM (15). lrmd[3567]: 2008/10/28_15:00:59 WARN: Managed resource_drbd0:1:monitor process 10459 killed by signal 15 [SIGTERM - Termination (ANSI)]. lrmd[3567]: 2008/10/28_15:01:01 WARN: operation monitor[18] on ocf::drbd::resource_drbd0:1 for client 3570, its parameters: CRM_meta_interval=[59000] CRM_meta_role=[Master ] CRM_meta_start_delay=[0] CRM_meta_op_target_rc=[8] CRM_meta_notify_start_resource=[resource_drbd0:1 ] drbd_resource=[r0] CRM_meta_notify_master_uname=[jupiter1 ] CRM_met a_id=[611be629-11ad-42ec-beaf-56de832e23e0] CRM_meta_master_max=[1] CRM_meta_notify_master_resource=[resource_drbd0:0 ] CRM_meta_stateful=[true] CRM_meta_timeout=[10000] c rm_feature_set=[2.0] CRM_meta_disabl: pid [10459] timed out lrmd[3567]: 2008/10/28_15:01:01 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 1990 ms (> 300 ms) (GSource: 0x805f1b8) crmd[3570]: 2008/10/28_15:01:01 WARN: G_CH_dispatch_int: resume_io() for IPC channel took 50 ms crmd[3570]: 2008/10/28_15:01:02 ERROR: process_lrm_event: LRM operation resource_drbd0:1_monitor_59000 (18) Timed Out (timeout=10000ms) cib[3566]: 2008/10/28_15:01:02 WARN: G_CH_dispatch_int: resume_io() for IPC channel took 60 ms crmd[3570]: 2008/10/28_15:01:05 WARN: G_CH_dispatch_int: resume_io() for IPC channel took 30 ms tengine[4073]: 2008/10/28_15:01:05 WARN: G_CH_dispatch_int: resume_io() for IPC channel took 50 ms tengine[4073]: 2008/10/28_15:01:05 info: process_graph_event: Action resource_drbd0:1_monitor_59000 (2:-2;14:10:8:07f5d212-4901-4da9-a00a-93e7c5ccc832) initiated by a diff erent transitioner tengine[4073]: 2008/10/28_15:01:05 info: update_abort_priority: Abort priority upgraded to 1000000 tengine[4073]: 2008/10/28_15:01:05 WARN: update_failcount: Updating failcount for resource_drbd0:1 on 81640fe5-d06b-4dbf-9e15-c5b2dde1c1eb after failed monitor: rc=-2 (upd ate=value++) mgmtd[3571]: 2008/10/28_15:01:05 WARN: G_CH_dispatch_int: resume_io() for IPC channel took 280 ms tengine[4073]: 2008/10/28_15:01:05 WARN: notify_crmd: Delaying completion until all CIB updates complete crmd[3570]: 2008/10/28_15:01:05 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] crmd[3570]: 2008/10/28_15:01:05 info: do_state_transition: All 2 cluster nodes are eligible to run resources. pengine[4074]: 2008/10/28_15:01:06 WARN: G_CH_dispatch_int: resume_io() for IPC channel took 70 ms pengine[4074]: 2008/10/28_15:01:06 info: determine_online_status: Node jupiter1 is online pengine[4074]: 2008/10/28_15:01:06 info: determine_online_status: Node jupiter2 is online pengine[4074]: 2008/10/28_15:01:06 WARN: unpack_rsc_op: Processing failed op resource_drbd0:1_monitor_59000 on jupiter2: Timed Out This is host "jupiter1", slave: [...] heartbeat[3359]: 2008/10/28_14:54:32 info: MSG stats: 0/0 ms age 1143841156 [pid3525/HBREAD] heartbeat[3359]: 2008/10/28_14:54:32 info: cl_malloc stats: 0/0 0/0 [pid3525/HBREAD] heartbeat[3359]: 2008/10/28_14:54:32 info: RealMalloc stats: 0 total malloc bytes. pid [3525/HBREAD] heartbeat[3359]: 2008/10/28_14:54:32 info: Current arena value: 0 heartbeat[3359]: 2008/10/28_14:54:32 info: These are nothing to worry about. [... now things get worse] crmd[3533]: 2008/10/28_15:01:07 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=60:3:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:07 info: rsc:resource_drbd0:0: notify lrmd[3530]: 2008/10/28_15:01:07 info: Managed resource_drbd0:0:notify process 22645 exited with return code 0. crmd[3533]: 2008/10/28_15:01:07 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=13, rc=0) complete crmd[3533]: 2008/10/28_15:01:17 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=61:3:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:17 info: rsc:resource_drbd0:0: notify crm_master[22704]: 2008/10/28_15:01:17 info: Invoked: /usr/sbin/crm_master -l reboot -v 75 crm_master[22704]: 2008/10/28_15:01:17 info: main: Detected uname: jupiter1 crm_master[22704]: 2008/10/28_15:01:17 info: main: Mapped jupiter1 to 176c524b-d312-4d07-af55-22801f6e00c7 lrmd[3530]: 2008/10/28_15:01:18 info: RA output: (resource_drbd0:0:notify:stdout) No set matching id=master-176c524b-d312-4d07-af55-22801f6e00c7 in status lrmd[3530]: 2008/10/28_15:01:18 info: Managed resource_drbd0:0:notify process 22657 exited with return code 0. crmd[3533]: 2008/10/28_15:01:18 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=14, rc=0) complete crmd[3533]: 2008/10/28_15:01:27 info: send_direct_ack: ACK'ing resource op resource_drbd0:0_monitor_60000 from 1:4:0:27105950-99a6-457e-9d57-9c9861655a98: lrm_invoke-lrmd- 1225202487-4 crmd[3533]: 2008/10/28_15:01:27 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=58:4:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:27 info: rsc:resource_drbd0:0: notify crmd[3533]: 2008/10/28_15:01:27 info: process_lrm_event: LRM operation resource_drbd0:0_monitor_60000 (call=12, rc=-2) Cancelled lrmd[3530]: 2008/10/28_15:01:27 info: Managed resource_drbd0:0:notify process 22751 exited with return code 0. crmd[3533]: 2008/10/28_15:01:27 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=15, rc=0) complete crmd[3533]: 2008/10/28_15:01:27 info: do_lrm_rsc_op: Performing op=resource_MailAdmin_start_0 key=47:4:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:27 info: rsc:resource_MailAdmin: start lrmd[3530]: 2008/10/28_15:01:27 info: Managed resource_MailAdmin:start process 22763 exited with return code 0. crmd[3533]: 2008/10/28_15:01:27 info: process_lrm_event: LRM operation resource_MailAdmin_start_0 (call=16, rc=0) complete crmd[3533]: 2008/10/28_15:01:28 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=59:4:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:28 info: rsc:resource_drbd0:0: notify [... question: is the following 'update diff' FAILING healthy?] cib[3529]: 2008/10/28_15:01:28 info: apply_xml_diff: Digest mis-match: expected 7f4e9667b7921803ff70a5f51421dd8a, calculated ebf24784543879a1fc1952474a2aaca4 cib[3529]: 2008/10/28_15:01:28 info: cib_process_diff: Diff 0.519.44 -> 0.519.45 not applied to 0.519.44: Failed application of a global update. Requesting full refresh. cib[3529]: 2008/10/28_15:01:28 info: cib_process_diff: Requesting re-sync from peer: Failed application of a global update. Requesting full refresh. cib[3529]: 2008/10/28_15:01:28 WARN: do_cib_notify: cib_apply_diff of <diff > FAILED: Application of an update diff failed, requesting a full refresh cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_request: cib_apply_diff operation failed: Application of an update diff failed, requesting a full refresh cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_diff: Not applying diff 0.519.45 -> 0.519.46 (sync in progress) cib[3529]: 2008/10/28_15:01:28 WARN: do_cib_notify: cib_apply_diff of <diff > FAILED: Application of an update diff failed, requesting a full refresh cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_request: cib_apply_diff operation failed: Application of an update diff failed, requesting a full refresh cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_diff: Not applying diff 0.519.46 -> 0.519.47 (sync in progress) cib[3529]: 2008/10/28_15:01:28 WARN: do_cib_notify: cib_apply_diff of <diff > FAILED: Application of an update diff failed, requesting a full refresh cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_request: cib_apply_diff operation failed: Application of an update diff failed, requesting a full refresh crm_master[22854]: 2008/10/28_15:01:28 info: Invoked: /usr/sbin/crm_master -l reboot -v 10 crm_master[22854]: 2008/10/28_15:01:28 info: main: Detected uname: jupiter1 crm_master[22854]: 2008/10/28_15:01:28 info: main: Mapped jupiter1 to 176c524b-d312-4d07-af55-22801f6e00c7 cib[3529]: 2008/10/28_15:01:29 info: cib_replace_notify: Replaced: 0.519.44 -> 0.519.47 from <null> crmd[3533]: 2008/10/28_15:01:29 info: populate_cib_nodes: Requesting the list of configured nodes lrmd[3530]: 2008/10/28_15:01:29 info: RA output: (resource_drbd0:0:notify:stdout) No set matching id=master-176c524b-d312-4d07-af55-22801f6e00c7 in status lrmd[3530]: 2008/10/28_15:01:29 info: Managed resource_drbd0:0:notify process 22807 exited with return code 0. crmd[3533]: 2008/10/28_15:01:29 notice: populate_cib_nodes: Node: jupiter2 (uuid: 81640fe5-d06b-4dbf-9e15-c5b2dde1c1eb) crmd[3533]: 2008/10/28_15:01:29 notice: populate_cib_nodes: Node: jupiter1 (uuid: 176c524b-d312-4d07-af55-22801f6e00c7) crmd[3533]: 2008/10/28_15:01:29 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=17, rc=0) complete crmd[3533]: 2008/10/28_15:01:31 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=54:5:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:31 info: rsc:resource_drbd0:0: notify lrmd[3530]: 2008/10/28_15:01:31 info: Managed resource_drbd0:0:notify process 22857 exited with return code 0. crmd[3533]: 2008/10/28_15:01:31 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=18, rc=0) complete crmd[3533]: 2008/10/28_15:01:32 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=55:5:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:32 info: rsc:resource_drbd0:0: notify crm_master[22917]: 2008/10/28_15:01:32 info: Invoked: /usr/sbin/crm_master -l reboot -v 10 crm_master[22917]: 2008/10/28_15:01:32 info: main: Detected uname: jupiter1 crm_master[22917]: 2008/10/28_15:01:32 info: main: Mapped jupiter1 to 176c524b-d312-4d07-af55-22801f6e00c7 lrmd[3530]: 2008/10/28_15:01:34 info: RA output: (resource_drbd0:0:notify:stdout) No set matching id=master-176c524b-d312-4d07-af55-22801f6e00c7 in status lrmd[3530]: 2008/10/28_15:01:34 info: Managed resource_drbd0:0:notify process 22869 exited with return code 0. crmd[3533]: 2008/10/28_15:01:34 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=19, rc=0) complete crmd[3533]: 2008/10/28_15:01:35 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_notify_0 key=58:6:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:35 info: rsc:resource_drbd0:0: notify lrmd[3530]: 2008/10/28_15:01:35 info: Managed resource_drbd0:0:notify process 22920 exited with return code 0. crmd[3533]: 2008/10/28_15:01:35 info: process_lrm_event: LRM operation resource_drbd0:0_notify_0 (call=20, rc=0) complete crmd[3533]: 2008/10/28_15:01:36 info: do_lrm_rsc_op: Performing op=resource_drbd0:0_promote_0 key=9:6:0:27105950-99a6-457e-9d57-9c9861655a98) lrmd[3530]: 2008/10/28_15:01:36 info: rsc:resource_drbd0:0: promote lrmd[3530]: 2008/10/28_15:01:36 info: RA output: (resource_drbd0:0:promote:stdout) drbd[22932][23018]: 2008/10/28_15:01:36 INFO: r0 promote: primary succeeded lrmd[3530]: 2008/10/28_15:01:36 info: Managed resource_drbd0:0:promote process 22932 exited with return code 0. crmd[3533]: 2008/10/28_15:01:36 info: process_lrm_event: LRM operation resource_drbd0:0_promote_0 (call=21, rc=0) complete [...] The syslog shows no special events during that time which could be obviously related to harddisk / I/O failures. Is there some known glitch which can be worked around? If I can help in providing more info, I will be glad to do so. I also have a feeling of drbd being quite slow on SLES setups (compared i.e. to Debian), but I haven't found the time to do some benchmarks on identical hardware yet. Best regards, Jens -- /* www: http://blog.basquiat.de | gpg: 0x0E4D824B icq: 22849810 | jid: [EMAIL PROTECTED] */ _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
