I have been testing usage of HA2.x as a redundancy management solution by 
running a few simple 2-node clusters with HA 2.1.4 on RHES4update4 . By simple 
i mean one resource group running an OCF tomcat, an LSB application and a 
floating IP.

On most of the systems i have severe stability issues where one of the node is 
virtually killed approx. once a day (STONITH is disabled).

Looking through the lists i have a feeling these are issues which have already 
been dealt on other platforms and are supposed to be resolved. I also tackled a 
few issues with the hb_gui which were related to the fact that standard RHES4 
distributions do not contain the proper python and python-gtk versions to run a 
full-fledged hb_gui (which i worked around by using command lines where the gui 
failed to perform).

Maybe i am wrong, but all this leads me to think that HA2.1.4 and RHES4u4 do 
not walk nicely hand in hand or that i am getting something totally wrong, 
although i do have one stable cluster with "similar" configuration.

I'd appreciate your feedback on this.

Thanks in advance,
Rami.

p.s.1 ha.cf for such a problematic system:

crm on
ucast eth0 10.36.22.173
auto_failback off
node    fox4
node    fox6
use_logd yes

p.s.2 a syslog bit of a failure which rendered a node (namely fox6) dead:

Jan  5 01:21:01 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:21:01 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432920022 should have started at 432919751
Jan  5 01:21:05 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2700 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:21:05 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432920392 should have started at 432920122
Jan  5 01:21:05 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:21:05 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432920393 should have started at 432920122
Jan  5 01:21:09 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:21:09 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432920763 should have started at 432920492
Jan  5 01:21:09 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:21:13 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432920764 should have started at 432920493
Jan  5 01:21:16 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:21:20 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432921134 should have started at 432920863
Jan  5 01:21:27 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2700 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:21:31 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432921134 should have started at 432920864
Jan  5 01:21:35 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan  5 01:21:39 fox6 cib: [3929]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan  5 01:21:50 fox6 crmd: [3933]: info: mem_handle_event: no mbr_track info
Jan  5 01:21:53 fox6 cib: [3929]: info: mem_handle_event: no mbr_track info
Jan  5 01:22:04 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan  5 01:22:12 fox6 cib: [3929]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan  5 01:22:19 fox6 crmd: [3933]: info: mem_handle_event: instance=41, 
nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4
Jan  5 01:22:30 fox6 cib: [3929]: info: mem_handle_event: instance=41, nodes=2, 
new=0, lost=0, n_idx=0, new_idx=2, old_idx=4
Jan  5 01:22:41 fox6 crmd: [3933]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=41)
Jan  5 01:22:49 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox4
Jan  5 01:22:56 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox6
Jan  5 01:23:00 fox6 crmd: [3933]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=41, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4
Jan  5 01:23:04 fox6 crmd: [3933]: info: ccm_event_detail:      CURRENT: fox4 
[nodeid=0, born=1]
Jan  5 01:23:11 fox6 crmd: [3933]: info: ccm_event_detail:      CURRENT: fox6 
[nodeid=1, born=41]
Jan  5 01:23:15 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:23:22 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432921505 should have started at 432921234
Jan  5 01:23:26 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:23:33 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432921505 should have started at 432921234
Jan  5 01:23:41 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2700 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:23:45 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432921875 should have started at 432921605
Jan  5 01:23:48 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:23:56 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432921876 should have started at 432921605
Jan  5 01:23:59 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:24:03 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432922246 should have started at 432921975
Jan  5 01:24:07 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan  5 01:24:10 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2700 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:24:18 fox6 crmd: [3933]: info: mem_handle_event: no mbr_track info
Jan  5 01:24:22 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432922246 should have started at 432921976
Jan  5 01:24:29 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan  5 01:24:33 fox6 crmd: [3933]: info: mem_handle_event: instance=44, 
nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4
Jan  5 01:24:40 fox6 crmd: [3933]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=44)
Jan  5 01:24:48 fox6 crmd: [3933]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=44, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4
Jan  5 01:24:55 fox6 crmd: [3933]: info: ccm_event_detail:      CURRENT: fox4 
[nodeid=0, born=1]
Jan  5 01:24:59 fox6 crmd: [3933]: info: ccm_event_detail:      CURRENT: fox6 
[nodeid=1, born=44]
Jan  5 01:25:06 fox6 cib: [3929]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan  5 01:25:10 fox6 cib: [3929]: info: mem_handle_event: no mbr_track info
Jan  5 01:25:13 fox6 cib: [3929]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan  5 01:25:21 fox6 cib: [3929]: info: mem_handle_event: instance=44, nodes=2, 
new=0, lost=0, n_idx=0, new_idx=2, old_idx=4
Jan  5 01:25:25 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox4
Jan  5 01:25:32 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox6
Jan  5 01:25:39 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:25:43 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432922617 should have started at 432922346
Jan  5 01:25:47 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:25:51 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:25:54 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432922617 should have started at 432922346
Jan  5 01:25:58 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:26:02 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2700 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:26:05 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:26:13 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432922987 should have started at 432922717
Jan  5 01:26:16 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:26:24 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:26:31 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:26:39 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432922988 should have started at 432922717
Jan  5 01:26:42 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:26:50 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 2710 ms (> 510 ms) before being 
called (GSource: 0x9113798)
Jan  5 01:26:57 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:27:01 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan  5 01:27:08 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:27:27 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:27:20 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:27:27 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:27:12 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432923358 should have started at 432923087
Jan  5 01:27:31 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:27:38 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:27:45 fox6 crmd: [3933]: info: mem_handle_event: no mbr_track info
Jan  5 01:27:53 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:27:57 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:28:04 fox6 heartbeat: [3241]: WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 2700 ms (> 510 ms) before being 
called (GSource: 0x91139c8)
Jan  5 01:28:11 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:28:19 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:28:23 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan  5 01:28:30 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:28:37 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:28:41 fox6 heartbeat: [3241]: info: Gmain_timeout_dispatch: started 
at 432923358 should have started at 432923088
Jan  5 01:28:45 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:28:52 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:28:56 fox6 crmd: [3933]: info: mem_handle_event: instance=47, 
nodes=2, new=0, lost=0, n_idx=0, new_idx=2, old_idx=4
Jan  5 01:29:03 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:29:07 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:29:14 fox6 crmd: [3933]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=47)
Jan  5 01:29:22 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:29:29 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:29:33 fox6 crmd: [3933]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=47, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4
Jan  5 01:29:37 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:29:44 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:29:48 fox6 crmd: [3933]: info: ccm_event_detail:      CURRENT: fox4 
[nodeid=0, born=1]
Jan  5 01:29:52 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:29:59 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:30:03 fox6 crmd: [3933]: info: ccm_event_detail:      CURRENT: fox6 
[nodeid=1, born=47]
Jan  5 01:30:06 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:30:14 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:30:21 fox6 cib: [3929]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan  5 01:30:25 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:30:29 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:30:36 fox6 cib: [3929]: info: mem_handle_event: no mbr_track info
Jan  5 01:30:47 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:30:55 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:30:58 fox6 cib: [3929]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan  5 01:31:06 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:31:09 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:31:13 fox6 cib: [3929]: info: mem_handle_event: instance=47, nodes=2, 
new=0, lost=0, n_idx=0, new_idx=2, old_idx=4
Jan  5 01:31:17 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:31:20 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:31:28 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox4
Jan  5 01:31:32 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:31:39 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:31:46 fox6 cib: [3929]: info: cib_ccm_msg_callback: PEER: fox6
Jan  5 01:31:50 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:31:58 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:32:09 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:32:16 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:32:27 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:32:46 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:32:42 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:32:46 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:32:49 fox6 crmd: [3933]: WARN: send queue maximum length(500) exceeded
Jan  5 01:32:57 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:33:12 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:33:30 fox6 ccm: [3928]: info: Break tie for 2 nodes cluster
Jan  5 01:33:23 fox6 ccm: [3928]: WARN: send queue maximum length(64) exceeded
Jan  5 01:33:30 fox6 crmd: [3933]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
[ at this point the node no longer responds, declared "stopped" in the GUI and 
OFFLINE is crm_mon, and has to be rebooted ]


________________________________
This e-mail is confidential, the property of NDS Ltd and intended for the 
addressee only. Any dissemination, copying or distribution of this message or 
any attachments by anyone other than the intended recipient is strictly 
prohibited. If you have received this message in error, please immediately 
notify the [email protected] and destroy the original message. Messages sent 
to and from NDS may be monitored. NDS cannot guarantee any message delivery 
method is secure or error-free. Information could be intercepted, corrupted, 
lost, destroyed, arrive late or incomplete, or contain viruses. We do not 
accept responsibility for any errors or omissions in this message and/or 
attachment that arise as a result of transmission. You should carry out your 
own virus checks before opening any attachment. Any views or opinions presented 
are solely those of the author and do not necessarily represent those of NDS.

To protect the environment please do not print this e-mail unless necessary.

NDS Limited Registered office: One Heathrow Boulevard, 286 Bath Road, West 
Drayton, Middlesex, UB7 0DQ, United Kingdom. A company registered in England 
and Wales Registered no. 3080780 VAT no. GB 603 8808 40-00
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to