Hi, On Mon, Jan 05, 2009 at 11:29:41AM +0200, Zakh, Rami wrote: > I have been testing usage of HA2.x as a redundancy management > solution by running a few simple 2-node clusters with HA 2.1.4 > on RHES4update4 .
RHES is a RedHat product? I think that people run heartbeat on RHEL4 too. > By simple i mean one resource group running > an OCF tomcat, an LSB application and a floating IP. > > On most of the systems i have severe stability issues where one > of the node is virtually killed approx. once a day (STONITH is > disabled). > > Looking through the lists i have a feeling these are issues > which have already been dealt on other platforms and are > supposed to be resolved. I also tackled a few issues with the > hb_gui which were related to the fact that standard RHES4 > distributions do not contain the proper python and python-gtk > versions to run a full-fledged hb_gui (which i worked around by > using command lines where the gui failed to perform). > > Maybe i am wrong, but all this leads me to think that HA2.1.4 > and RHES4u4 do not walk nicely hand in hand or that i am > getting something totally wrong, although i do have one stable > cluster with "similar" configuration. > > I'd appreciate your feedback on this. On to the log: > Jan 5 01:21:16 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) This is very serious. The local status is being delayed. A kernel/scheduler problem? Hardware issues? Which kernel do you run? Anything in system logs? The heartbeat process is locked in memory and runs at the highest priority, so it shouldn't depend much on the system load. > Jan 5 01:29:52 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:29:59 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded According to the archives, there has been only one occurence of such messages reported. Unfortunately, the user didn't provide details. For whatever reason, the IPC layer is not able to send messages. I guess that this is a consequence of the same problem. Does this happen at the same times, i.e. is it maybe due to some heavy hardware oriented processing such as backup? Is your kernel updated? Thanks, Dejan > Thanks in advance, > Rami. > > p.s.1 ha.cf for such a problematic system: > > crm on > ucast eth0 10.36.22.173 > auto_failback off > node fox4 > node fox6 > use_logd yes > > p.s.2 a syslog bit of a failure which rendered a node (namely fox6) dead: > > Jan 5 01:21:01 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:21:01 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432920022 should have started at 432919751 > Jan 5 01:21:05 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2700 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:21:05 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432920392 should have started at 432920122 > Jan 5 01:21:05 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:21:05 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432920393 should have started at 432920122 > Jan 5 01:21:09 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:21:09 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432920763 should have started at 432920492 > Jan 5 01:21:09 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:21:13 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432920764 should have started at 432920493 > Jan 5 01:21:16 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:21:20 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432921134 should have started at 432920863 > Jan 5 01:21:27 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2700 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:21:31 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432921134 should have started at 432920864 > Jan 5 01:21:35 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 5 01:21:39 fox6 cib: [3929]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 5 01:21:50 fox6 crmd: [3933]: info: mem_handle_event: no mbr_track info > Jan 5 01:21:53 fox6 cib: [3929]: info: mem_handle_event: no mbr_track info > Jan 5 01:22:04 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 5 01:22:12 fox6 cib: [3929]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 5 01:22:19 fox6 crmd: [3933]: info: mem_handle_event: instance=41, > nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:22:30 fox6 cib: [3929]: info: mem_handle_event: instance=41, > nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:22:41 fox6 crmd: [3933]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=41) > Jan 5 01:22:49 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox4 > Jan 5 01:22:56 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox6 > Jan 5 01:23:00 fox6 crmd: [3933]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=41, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:23:04 fox6 crmd: [3933]: info: ccm_event_detail: CURRENT: fox4 > [nodeid=0, born=1] > Jan 5 01:23:11 fox6 crmd: [3933]: info: ccm_event_detail: CURRENT: fox6 > [nodeid=1, born=41] > Jan 5 01:23:15 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:23:22 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432921505 should have started at 432921234 > Jan 5 01:23:26 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:23:33 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432921505 should have started at 432921234 > Jan 5 01:23:41 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2700 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:23:45 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432921875 should have started at 432921605 > Jan 5 01:23:48 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:23:56 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432921876 should have started at 432921605 > Jan 5 01:23:59 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:24:03 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432922246 should have started at 432921975 > Jan 5 01:24:07 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 5 01:24:10 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2700 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:24:18 fox6 crmd: [3933]: info: mem_handle_event: no mbr_track info > Jan 5 01:24:22 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432922246 should have started at 432921976 > Jan 5 01:24:29 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 5 01:24:33 fox6 crmd: [3933]: info: mem_handle_event: instance=44, > nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:24:40 fox6 crmd: [3933]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=44) > Jan 5 01:24:48 fox6 crmd: [3933]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=44, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:24:55 fox6 crmd: [3933]: info: ccm_event_detail: CURRENT: fox4 > [nodeid=0, born=1] > Jan 5 01:24:59 fox6 crmd: [3933]: info: ccm_event_detail: CURRENT: fox6 > [nodeid=1, born=44] > Jan 5 01:25:06 fox6 cib: [3929]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 5 01:25:10 fox6 cib: [3929]: info: mem_handle_event: no mbr_track info > Jan 5 01:25:13 fox6 cib: [3929]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 5 01:25:21 fox6 cib: [3929]: info: mem_handle_event: instance=44, > nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:25:25 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox4 > Jan 5 01:25:32 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox6 > Jan 5 01:25:39 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:25:43 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432922617 should have started at 432922346 > Jan 5 01:25:47 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:25:51 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:25:54 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432922617 should have started at 432922346 > Jan 5 01:25:58 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:26:02 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2700 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:26:05 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:26:13 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432922987 should have started at 432922717 > Jan 5 01:26:16 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:26:24 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:26:31 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:26:39 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432922988 should have started at 432922717 > Jan 5 01:26:42 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:26:50 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 2710 ms (> 510 ms) before > being called (GSource: 0x9113798) > Jan 5 01:26:57 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:27:01 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 5 01:27:08 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:27:27 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:27:20 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:27:27 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:27:12 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432923358 should have started at 432923087 > Jan 5 01:27:31 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:27:38 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:27:45 fox6 crmd: [3933]: info: mem_handle_event: no mbr_track info > Jan 5 01:27:53 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:27:57 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:28:04 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 2700 ms (> 510 ms) before > being called (GSource: 0x91139c8) > Jan 5 01:28:11 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:28:19 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:28:23 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 5 01:28:30 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:28:37 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:28:41 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started > at 432923358 should have started at 432923088 > Jan 5 01:28:45 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:28:52 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:28:56 fox6 crmd: [3933]: info: mem_handle_event: instance=47, > nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:29:03 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:29:07 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:29:14 fox6 crmd: [3933]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=47) > Jan 5 01:29:22 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:29:29 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:29:33 fox6 crmd: [3933]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=47, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:29:37 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:29:44 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:29:48 fox6 crmd: [3933]: info: ccm_event_detail: CURRENT: fox4 > [nodeid=0, born=1] > Jan 5 01:29:52 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:29:59 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:30:03 fox6 crmd: [3933]: info: ccm_event_detail: CURRENT: fox6 > [nodeid=1, born=47] > Jan 5 01:30:06 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:30:14 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:30:21 fox6 cib: [3929]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Jan 5 01:30:25 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:30:29 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:30:36 fox6 cib: [3929]: info: mem_handle_event: no mbr_track info > Jan 5 01:30:47 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:30:55 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:30:58 fox6 cib: [3929]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Jan 5 01:31:06 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:31:09 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:31:13 fox6 cib: [3929]: info: mem_handle_event: instance=47, > nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4 > Jan 5 01:31:17 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:31:20 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:31:28 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox4 > Jan 5 01:31:32 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:31:39 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:31:46 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox6 > Jan 5 01:31:50 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:31:58 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:32:09 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:32:16 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:32:27 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:32:46 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:32:42 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:32:46 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:32:49 fox6 crmd: [3933]: WARN: send queue maximum length(500) > exceeded > Jan 5 01:32:57 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:33:12 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:33:30 fox6 ccm: [3928]: info: Break tie for 2 nodes cluster > Jan 5 01:33:23 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded > Jan 5 01:33:30 fox6 crmd: [3933]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > [ at this point the node no longer responds, declared "stopped" in the GUI > and OFFLINE is crm_mon, and has to be rebooted ] > > > ________________________________ > This e-mail is confidential, the property of NDS Ltd and intended for the > addressee only. Any dissemination, copying or distribution of this message or > any attachments by anyone other than the intended recipient is strictly > prohibited. If you have received this message in error, please immediately > notify the [email protected] and destroy the original message. Messages sent > to and from NDS may be monitored. NDS cannot guarantee any message delivery > method is secure or error-free. Information could be intercepted, corrupted, > lost, destroyed, arrive late or incomplete, or contain viruses. We do not > accept responsibility for any errors or omissions in this message and/or > attachment that arise as a result of transmission. You should carry out your > own virus checks before opening any attachment. Any views or opinions > presented are solely those of the author and do not necessarily represent > those of NDS. > > To protect the environment please do not print this e-mail unless necessary. > > NDS Limited Registered office: One Heathrow Boulevard, 286 Bath Road, West > Drayton, Middlesex, UB7 0DQ, United Kingdom. A company registered in England > and Wales Registered no. 3080780 VAT no. GB 603 8808 40-00 > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
