Dear all,

as I set up a 2-nodes cluster using heartbeat, I encounter an annoying
problem related to the monitoring process of drbd. A little research
on the net gave me two pointers, albeit without an answer to me:

<http://www.mail-archive.com/[email protected]/msg08716.html>
--> "[Linux-HA] More monitor failures near G_SIG_dispatch delays"

<http://lists.community.tummy.com/pipermail/linux-ha/2008-September/034489.html>
--> "[Linux-HA] problem with drbd verify and heartbeat?"

First some notes about the environment. Both nodes are running pretty
identical hardware with Novell SLES 10 SP2 installed.

- Kernel is "2.6.16.60-0.29-smp"
  (SLES packages)

- Heartbeat is "HA Version 2.1.4, CRM Version 2.0 (CIB feature set 2.0)"
  (SLES packages)

- DRBD is "Version: 0.7.22 (api:79) SVN Revision: 2572"
  (SLES packages)

The nodes are running a classical failover setup with one partition
shared via DRBD, holding mainly Squid caches and local Bind DNS zones. 

I put the full configuration on my webspace (please let me know if 
that's not enough):

cib.xml         <http://www.basquiat.de/files/ha/cib.xml>
ha.cf           <http://www.basquiat.de/files/ha/ha.cf>
drbd.conf       <http://www.basquiat.de/files/ha/drbd.conf>

Every few weeks, the drbd-Resource-Monitor times out and forces a failover.

The most relevant log entries (please also let me know if you need more):
-------------------------------------------------------------------------

This is host "jupiter2", master:
[...]
heartbeat[3404]: 2008/10/28_14:54:16 info: MSG stats: 0/0 ms age 1570438646 
[pid3431/HBREAD]
heartbeat[3404]: 2008/10/28_14:54:16 info: cl_malloc stats: 0/0  0/0 
[pid3431/HBREAD]
heartbeat[3404]: 2008/10/28_14:54:16 info: RealMalloc stats: 0 total malloc 
bytes. pid [3431/HBREAD]
heartbeat[3404]: 2008/10/28_14:54:16 info: Current arena value: 0
heartbeat[3404]: 2008/10/28_14:54:16 info: These are nothing to worry about.
[... now things get worse]
lrmd[3567]: 2008/10/28_15:00:59 WARN: resource_drbd0:1:monitor process (PID 
10459) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd[3567]: 2008/10/28_15:00:59 WARN: Managed resource_drbd0:1:monitor process 
10459 killed by signal 15 [SIGTERM - Termination (ANSI)].
lrmd[3567]: 2008/10/28_15:01:01 WARN: operation monitor[18] on 
ocf::drbd::resource_drbd0:1 for client 3570, its parameters: 
CRM_meta_interval=[59000] CRM_meta_role=[Master
] CRM_meta_start_delay=[0] CRM_meta_op_target_rc=[8] 
CRM_meta_notify_start_resource=[resource_drbd0:1 ] drbd_resource=[r0] 
CRM_meta_notify_master_uname=[jupiter1 ] CRM_met
a_id=[611be629-11ad-42ec-beaf-56de832e23e0] CRM_meta_master_max=[1] 
CRM_meta_notify_master_resource=[resource_drbd0:0 ] CRM_meta_stateful=[true] 
CRM_meta_timeout=[10000] c
rm_feature_set=[2.0] CRM_meta_disabl: pid [10459] timed out
lrmd[3567]: 2008/10/28_15:01:01 WARN: G_SIG_dispatch: Dispatch function for 
SIGCHLD took too long to execute: 1990 ms (> 300 ms) (GSource: 0x805f1b8)
crmd[3570]: 2008/10/28_15:01:01 WARN: G_CH_dispatch_int: resume_io() for IPC 
channel took 50 ms
crmd[3570]: 2008/10/28_15:01:02 ERROR: process_lrm_event: LRM operation 
resource_drbd0:1_monitor_59000 (18) Timed Out (timeout=10000ms)
cib[3566]: 2008/10/28_15:01:02 WARN: G_CH_dispatch_int: resume_io() for IPC 
channel took 60 ms
crmd[3570]: 2008/10/28_15:01:05 WARN: G_CH_dispatch_int: resume_io() for IPC 
channel took 30 ms
tengine[4073]: 2008/10/28_15:01:05 WARN: G_CH_dispatch_int: resume_io() for IPC 
channel took 50 ms
tengine[4073]: 2008/10/28_15:01:05 info: process_graph_event: Action 
resource_drbd0:1_monitor_59000 
(2:-2;14:10:8:07f5d212-4901-4da9-a00a-93e7c5ccc832) initiated by a diff
erent transitioner
tengine[4073]: 2008/10/28_15:01:05 info: update_abort_priority: Abort priority 
upgraded to 1000000
tengine[4073]: 2008/10/28_15:01:05 WARN: update_failcount: Updating failcount 
for resource_drbd0:1 on 81640fe5-d06b-4dbf-9e15-c5b2dde1c1eb after failed 
monitor: rc=-2 (upd
ate=value++)
mgmtd[3571]: 2008/10/28_15:01:05 WARN: G_CH_dispatch_int: resume_io() for IPC 
channel took 280 ms
tengine[4073]: 2008/10/28_15:01:05 WARN: notify_crmd: Delaying completion until 
all CIB updates complete
crmd[3570]: 2008/10/28_15:01:05 info: do_state_transition: State transition 
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE 
origin=route_message ]
crmd[3570]: 2008/10/28_15:01:05 info: do_state_transition: All 2 cluster nodes 
are eligible to run resources.
pengine[4074]: 2008/10/28_15:01:06 WARN: G_CH_dispatch_int: resume_io() for IPC 
channel took 70 ms
pengine[4074]: 2008/10/28_15:01:06 info: determine_online_status: Node jupiter1 
is online
pengine[4074]: 2008/10/28_15:01:06 info: determine_online_status: Node jupiter2 
is online
pengine[4074]: 2008/10/28_15:01:06 WARN: unpack_rsc_op: Processing failed op 
resource_drbd0:1_monitor_59000 on jupiter2: Timed Out

This is host "jupiter1", slave:
[...]
heartbeat[3359]: 2008/10/28_14:54:32 info: MSG stats: 0/0 ms age 1143841156 
[pid3525/HBREAD]
heartbeat[3359]: 2008/10/28_14:54:32 info: cl_malloc stats: 0/0  0/0 
[pid3525/HBREAD]
heartbeat[3359]: 2008/10/28_14:54:32 info: RealMalloc stats: 0 total malloc 
bytes. pid [3525/HBREAD]
heartbeat[3359]: 2008/10/28_14:54:32 info: Current arena value: 0
heartbeat[3359]: 2008/10/28_14:54:32 info: These are nothing to worry about.
[... now things get worse]
crmd[3533]: 2008/10/28_15:01:07 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=60:3:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:07 info: rsc:resource_drbd0:0: notify
lrmd[3530]: 2008/10/28_15:01:07 info: Managed resource_drbd0:0:notify process 
22645 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:07 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=13, rc=0) complete
crmd[3533]: 2008/10/28_15:01:17 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=61:3:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:17 info: rsc:resource_drbd0:0: notify
crm_master[22704]: 2008/10/28_15:01:17 info: Invoked: /usr/sbin/crm_master -l 
reboot -v 75
crm_master[22704]: 2008/10/28_15:01:17 info: main: Detected uname: jupiter1
crm_master[22704]: 2008/10/28_15:01:17 info: main: Mapped jupiter1 to 
176c524b-d312-4d07-af55-22801f6e00c7
lrmd[3530]: 2008/10/28_15:01:18 info: RA output: 
(resource_drbd0:0:notify:stdout) No set matching 
id=master-176c524b-d312-4d07-af55-22801f6e00c7 in status
lrmd[3530]: 2008/10/28_15:01:18 info: Managed resource_drbd0:0:notify process 
22657 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:18 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=14, rc=0) complete
crmd[3533]: 2008/10/28_15:01:27 info: send_direct_ack: ACK'ing resource op 
resource_drbd0:0_monitor_60000 from 1:4:0:27105950-99a6-457e-9d57-9c9861655a98: 
lrm_invoke-lrmd-
1225202487-4
crmd[3533]: 2008/10/28_15:01:27 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=58:4:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:27 info: rsc:resource_drbd0:0: notify
crmd[3533]: 2008/10/28_15:01:27 info: process_lrm_event: LRM operation 
resource_drbd0:0_monitor_60000 (call=12, rc=-2) Cancelled
lrmd[3530]: 2008/10/28_15:01:27 info: Managed resource_drbd0:0:notify process 
22751 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:27 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=15, rc=0) complete
crmd[3533]: 2008/10/28_15:01:27 info: do_lrm_rsc_op: Performing 
op=resource_MailAdmin_start_0 key=47:4:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:27 info: rsc:resource_MailAdmin: start
lrmd[3530]: 2008/10/28_15:01:27 info: Managed resource_MailAdmin:start process 
22763 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:27 info: process_lrm_event: LRM operation 
resource_MailAdmin_start_0 (call=16, rc=0) complete
crmd[3533]: 2008/10/28_15:01:28 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=59:4:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:28 info: rsc:resource_drbd0:0: notify
[... question: is the following 'update diff' FAILING healthy?]
cib[3529]: 2008/10/28_15:01:28 info: apply_xml_diff: Digest mis-match: expected 
7f4e9667b7921803ff70a5f51421dd8a, calculated ebf24784543879a1fc1952474a2aaca4
cib[3529]: 2008/10/28_15:01:28 info: cib_process_diff: Diff 0.519.44 -> 
0.519.45 not applied to 0.519.44: Failed application of a global update.  
Requesting full refresh.
cib[3529]: 2008/10/28_15:01:28 info: cib_process_diff: Requesting re-sync from 
peer: Failed application of a global update.  Requesting full refresh.
cib[3529]: 2008/10/28_15:01:28 WARN: do_cib_notify: cib_apply_diff of <diff > 
FAILED: Application of an update diff failed, requesting a full refresh
cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_request: cib_apply_diff 
operation failed: Application of an update diff failed, requesting a full 
refresh
cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_diff: Not applying diff 
0.519.45 -> 0.519.46 (sync in progress)
cib[3529]: 2008/10/28_15:01:28 WARN: do_cib_notify: cib_apply_diff of <diff > 
FAILED: Application of an update diff failed, requesting a full refresh
cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_request: cib_apply_diff 
operation failed: Application of an update diff failed, requesting a full 
refresh
cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_diff: Not applying diff 
0.519.46 -> 0.519.47 (sync in progress)
cib[3529]: 2008/10/28_15:01:28 WARN: do_cib_notify: cib_apply_diff of <diff > 
FAILED: Application of an update diff failed, requesting a full refresh
cib[3529]: 2008/10/28_15:01:28 WARN: cib_process_request: cib_apply_diff 
operation failed: Application of an update diff failed, requesting a full 
refresh
crm_master[22854]: 2008/10/28_15:01:28 info: Invoked: /usr/sbin/crm_master -l 
reboot -v 10
crm_master[22854]: 2008/10/28_15:01:28 info: main: Detected uname: jupiter1
crm_master[22854]: 2008/10/28_15:01:28 info: main: Mapped jupiter1 to 
176c524b-d312-4d07-af55-22801f6e00c7
cib[3529]: 2008/10/28_15:01:29 info: cib_replace_notify: Replaced: 0.519.44 -> 
0.519.47 from <null>
crmd[3533]: 2008/10/28_15:01:29 info: populate_cib_nodes: Requesting the list 
of configured nodes
lrmd[3530]: 2008/10/28_15:01:29 info: RA output: 
(resource_drbd0:0:notify:stdout) No set matching 
id=master-176c524b-d312-4d07-af55-22801f6e00c7 in status
lrmd[3530]: 2008/10/28_15:01:29 info: Managed resource_drbd0:0:notify process 
22807 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:29 notice: populate_cib_nodes: Node: jupiter2 
(uuid: 81640fe5-d06b-4dbf-9e15-c5b2dde1c1eb)
crmd[3533]: 2008/10/28_15:01:29 notice: populate_cib_nodes: Node: jupiter1 
(uuid: 176c524b-d312-4d07-af55-22801f6e00c7)
crmd[3533]: 2008/10/28_15:01:29 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=17, rc=0) complete
crmd[3533]: 2008/10/28_15:01:31 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=54:5:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:31 info: rsc:resource_drbd0:0: notify
lrmd[3530]: 2008/10/28_15:01:31 info: Managed resource_drbd0:0:notify process 
22857 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:31 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=18, rc=0) complete
crmd[3533]: 2008/10/28_15:01:32 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=55:5:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:32 info: rsc:resource_drbd0:0: notify
crm_master[22917]: 2008/10/28_15:01:32 info: Invoked: /usr/sbin/crm_master -l 
reboot -v 10
crm_master[22917]: 2008/10/28_15:01:32 info: main: Detected uname: jupiter1
crm_master[22917]: 2008/10/28_15:01:32 info: main: Mapped jupiter1 to 
176c524b-d312-4d07-af55-22801f6e00c7
lrmd[3530]: 2008/10/28_15:01:34 info: RA output: 
(resource_drbd0:0:notify:stdout) No set matching 
id=master-176c524b-d312-4d07-af55-22801f6e00c7 in status
lrmd[3530]: 2008/10/28_15:01:34 info: Managed resource_drbd0:0:notify process 
22869 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:34 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=19, rc=0) complete
crmd[3533]: 2008/10/28_15:01:35 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_notify_0 key=58:6:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:35 info: rsc:resource_drbd0:0: notify
lrmd[3530]: 2008/10/28_15:01:35 info: Managed resource_drbd0:0:notify process 
22920 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:35 info: process_lrm_event: LRM operation 
resource_drbd0:0_notify_0 (call=20, rc=0) complete
crmd[3533]: 2008/10/28_15:01:36 info: do_lrm_rsc_op: Performing 
op=resource_drbd0:0_promote_0 key=9:6:0:27105950-99a6-457e-9d57-9c9861655a98)
lrmd[3530]: 2008/10/28_15:01:36 info: rsc:resource_drbd0:0: promote
lrmd[3530]: 2008/10/28_15:01:36 info: RA output: 
(resource_drbd0:0:promote:stdout)
drbd[22932][23018]: 2008/10/28_15:01:36 INFO: r0 promote: primary succeeded
lrmd[3530]: 2008/10/28_15:01:36 info: Managed resource_drbd0:0:promote process 
22932 exited with return code 0.
crmd[3533]: 2008/10/28_15:01:36 info: process_lrm_event: LRM operation 
resource_drbd0:0_promote_0 (call=21, rc=0) complete
[...]

The syslog shows no special events during that time which could be
obviously related to harddisk / I/O failures.

Is there some known glitch which can be worked around? If I can help
in providing more info, I will be glad to do so. I also have a feeling
of drbd being quite slow on SLES setups (compared i.e. to Debian),
but I haven't found the time to do some benchmarks on identical
hardware yet.

Best regards,
Jens

-- 
/* www: http://blog.basquiat.de    | gpg: 0x0E4D824B
   icq: 22849810                   | jid: [EMAIL PROTECTED] */
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to