Hi,
It looks like communication problem because of IP tables. After
rebooting your primary server you might have forgot to turn off the
iptables.
We have observed this "ERROR: Message hist queue is filling up (200
messages in queue)" when iptables are turned on.
Regards,
Arun.
Message: 5
Date: Thu, 6 Mar 2008 16:24:58 -0200
From: "Roberto Scattini" <[EMAIL PROTECTED]>
Subject: [Linux-HA] help with a broken cluster!
To: [email protected]
Message-ID:
<[EMAIL PROTECTED]>
Content-Type: text/plain; charset=ISO-8859-1
list:
i have a problem. i had a simple two node (lb2 and lb3) heartbeat v2
config working fine. i have two boxes, with 4 interfaces each. one
crossover cable between boxes and the nullmodem serial cable on
/dev/ttyS0. on the other interfaces of each box i have a real and a
virtual ip. heartbeat manages those virtual ips.
it was working fine, but one day one of my co-workers failed ( :S )
with a script that deleted /etc of the primary server.
the heartbeat didnt switched to the other machine (i think because the
particular problem), so the primary server was shutdown the hard
way...
then, the heartbeat switched the virtual ips, just as it was expected...
later my co-worker restored /etc from a backup done by the same script
that had deleted it ( :D ) and the virtual ips gone back to that
server since it was the preferred node...
this monday i came back from my vacations (yes, all this happened
during my vacations...) and yesterday i found that the heartbeat of
the primary server has 100% CPU usage and logs all the time:
Mar 6 11:58:48 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request took
too long to execute: 870 ms (> 10 ms) (GSource: 0x1dd38138)
Mar 6 11:58:48 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for send local status was
delayed 3130 ms (> 510 ms) before being called (GSource: 0x80fd0a8)
Mar 6 11:58:48 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839940551 should have started at
1839940238
Mar 6 11:58:48 localhost heartbeat: [2683]: ERROR: Message hist queue
is filling up (200 messages in queue)
Mar 6 11:58:48 localhost heartbeat: [2683]: debug: hist->ackseq =519517
Mar 6 11:58:48 localhost heartbeat: [2683]: debug: hist->lowseq
=519389, hist->hiseq=519589
Mar 6 11:58:48 localhost heartbeat: [2683]: debug: expecting from lb3
Mar 6 11:58:48 localhost heartbeat: [2683]: debug: it's ackseq=519517
Mar 6 11:58:48 localhost heartbeat: [2683]: debug:
Mar 6 11:58:48 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request was
delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dcfd4b8)
Mar 6 11:58:48 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839940617 should have started at
1839940229
Mar 6 11:58:49 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request took
too long to execute: 860 ms (> 10 ms) (GSource: 0x1dcfd4b8)
Mar 6 11:58:49 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request was
delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dd38268)
Mar 6 11:58:49 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839940703 should have started at
1839940315
Mar 6 11:58:50 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request took
too long to execute: 870 ms (> 10 ms) (GSource: 0x1dd38268)
Mar 6 11:58:50 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request was
delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dd38300)
Mar 6 11:58:50 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839940790 should have started at
1839940402
Mar 6 11:58:51 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request took
too long to execute: 870 ms (> 10 ms) (GSource: 0x1dd38300)
Mar 6 11:58:51 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request was
delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dd38398)
Mar 6 11:58:51 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839940877 should have started at
1839940489
Mar 6 11:58:52 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request took
too long to execute: 860 ms (> 10 ms) (GSource: 0x1dd38398)
Mar 6 11:58:52 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for send local status was
delayed 3120 ms (> 510 ms) before being called (GSource: 0x80fd0a8)
Mar 6 11:58:52 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839940963 should have started at
1839940651
Mar 6 11:58:52 localhost heartbeat: [2683]: ERROR: Message hist queue
is filling up (200 messages in queue)
Mar 6 11:58:52 localhost heartbeat: [2683]: debug: hist->ackseq =519517
Mar 6 11:58:52 localhost heartbeat: [2683]: debug: hist->lowseq
=519390, hist->hiseq=519590
Mar 6 11:58:52 localhost heartbeat: [2683]: debug: expecting from lb3
Mar 6 11:58:52 localhost heartbeat: [2683]: debug: it's ackseq=519517
Mar 6 11:58:52 localhost heartbeat: [2683]: debug:
Mar 6 11:58:52 localhost heartbeat: [2683]: WARN:
Gmain_timeout_dispatch: Dispatch function for retransmit request was
delayed 3650 ms (> 500 ms) before being called (GSource: 0x1dd1adf0)
Mar 6 11:58:52 localhost heartbeat: [2683]: info:
Gmain_timeout_dispatch: started at 1839941007 should have started at
1839940642
yesterday also discovered that the network connection between two
nodes with the crossover cable was unplugged (now is working fine)
today, restarted and deleted all the resources from the slave server
(lb3)... but heartbeat dont connects to the primary node... it says
this:
Mar 6 14:06:20 localhost logd: [1104]: info: logd started with
default configuration.
Mar 6 14:06:20 localhost logd: [1104]: WARN: Core dumps could be lost
if multiple dumps occur.
Mar 6 14:06:20 localhost logd: [1104]: WARN: Consider setting
non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
maximum
supportability
Mar 6 14:06:20 localhost logd: [1104]: WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
Mar 6 14:06:20 localhost logd: [1105]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Mar 6 14:06:20 localhost logd: [1104]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Mar 6 14:06:20 localhost heartbeat: [1125]: info: No log entry found
in ha.cf -- use logd
Mar 6 14:06:20 localhost heartbeat: [1125]: info: Enabling logging
daemon
Mar 6 14:06:20 localhost heartbeat: [1125]: info: logfile and debug
file are those specified in logd config file (default /etc/logd.cf)
Mar 6 14:06:20 localhost heartbeat: [1125]: info:
**************************
Mar 6 14:06:20 localhost heartbeat: [1125]: info: Configuration
validated. Starting heartbeat 2.1.2
Mar 6 14:06:20 localhost heartbeat: [1126]: info: heartbeat: version
2.1.2
Mar 6 14:06:20 localhost heartbeat: [1126]: info: Heartbeat
generation: 1201727554
Mar 6 14:06:20 localhost heartbeat: [1126]: info:
G_main_add_TriggerHandler: Added signal manual handler
Mar 6 14:06:20 localhost heartbeat: [1126]: info:
G_main_add_TriggerHandler: Added signal manual handler
Mar 6 14:06:20 localhost heartbeat: [1126]: info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
Mar 6 14:06:20 localhost heartbeat: [1126]: info: glib: UDP Broadcast
heartbeat started on port 694 (694) interface eth3
Mar 6 14:06:20 localhost heartbeat: [1126]: info: glib: UDP Broadcast
heartbeat closed on port 694 interface eth3 - Status: 1
Mar 6 14:06:20 localhost heartbeat: [1126]: info: glib: Starting
serial heartbeat on tty /dev/ttyS0 (19200 baud)
Mar 6 14:06:20 localhost heartbeat: [1126]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Mar 6 14:06:20 localhost heartbeat: [1126]: info: Local status now set
to: 'up'
Mar 6 14:06:21 localhost heartbeat: [1126]: info: Link lb2:eth3 up.
Mar 6 14:06:21 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 4 is max.
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: ha_msg_addraw_ll:
illegal field
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: ha_msg_addraw():
ha_msg_addraw_ll failed
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: NV failure
(string2msg_ll):
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: Input string: [>>>
t=NS_rexmit >>> t=NS_rexmit dest=lb3 firstseq=668078 lastseq=668078
(1)
destuuid=CDtAuMSNSVyK32eKbegm1w== src=lb2
(1)srcuuid=fj6r52ggTvKO8EGufK5c1g== hg=479df638 ts=47cfec7c ttl=3
auth=1 c47fe2984873891a3d839df5e25
fe9c9fbb4eafa <<< ]
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: sp=>>> t=NS_rexmit
dest=lb3 firstseq=668078 lastseq=668078
(1)destuuid=CDtAuMSNSVyK32eKbeg
m1w== src=lb2 (1)srcuuid=fj6r52ggTvKO8EGufK5c1g== hg=479df638
ts=47cfec7c ttl=3 auth=1 c47fe2984873891a3d839df5e25fe9c9fbb4eafa <<<
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: depth=0
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: MSG: Dumping
message with 1 fields
Mar 6 14:06:21 localhost heartbeat: [1126]: ERROR: MSG[0] :
[t=NS_rexmit]
Mar 6 14:06:21 localhost heartbeat: [1126]: info: Link lb2:/dev/ttyS0
up.
Mar 6 14:06:21 localhost heartbeat: [1126]: info: Status update for
node lb2: status active
Mar 6 14:06:21 localhost heartbeat: [1126]: info: Link lb3:eth3 up.
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Comm_now_up():
updating status to active
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Local status now
set to: 'active'
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/ccm" (999,999)
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/cib" (999,999)
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/lrmd -r" (0,0)
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/stonithd" (0,0)
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/attrd" (999,999)
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/crmd" (999,999)
Mar 6 14:06:22 localhost heartbeat: [1126]: info: Starting child
client "/usr/lib/heartbeat/mgmtd -v" (0,0)
Mar 6 14:06:22 localhost heartbeat: [1136]: info: Starting
"/usr/lib/heartbeat/ccm" as uid 999 gid 999 (pid 1136)
Mar 6 14:06:22 localhost heartbeat: [1137]: info: Starting
"/usr/lib/heartbeat/cib" as uid 999 gid 999 (pid 1137)
Mar 6 14:06:22 localhost cib: [1137]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Mar 6 14:06:22 localhost cib: [1137]: info:
G_main_add_TriggerHandler: Added signal manual handler
Mar 6 14:06:22 localhost cib: [1137]: info: G_main_add_SignalHandler:
Added signal handler for signal 17
Mar 6 14:06:22 localhost cib: [1137]: info: main: Retrieval of a
per-action CIB: disabled
Mar 6 14:06:22 localhost cib: [1137]: info: readCibXmlFile: Reading
cluster configuration from: /var/lib/heartbeat/crm/cib.xml
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <cib generated="false" admin_epoch="0"
have_quorum="f
alse" ignore_dtd="false" num_peers="2" cib_feature_revision="1.3"
epoch="24" num_updates="2" cib-last-written="Thu Mar 6 14:03:59 2008"
ccm_t
ransition="1">
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <configuration>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <crm_config/>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <nodes>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <node
id="083b40b8-c48d-495c-8adf-678a6de826d7"
uname="lb3" type="normal"/>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <node
id="7e3eabe7-6820-4ef2-8ef0-41ae7cae5cd6"
uname="lb2" type="normal"/>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] </nodes>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <resources/>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <constraints/>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] </configuration>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] <status/>
Mar 6 14:06:22 localhost cib: [1137]: info: log_data_element:
readCibXmlFile: [on-disk] </cib>
Mar 6 14:06:22 localhost cib: [1137]: info: startCib: CIB
Initialization completed successfully
Mar 6 14:06:22 localhost cib: [1137]: info: cib_register_ha: Signing
in with Heartbeat
Mar 6 14:06:22 localhost cib: [1137]: info: cib_register_ha: FSA
Hostname: lb3
Mar 6 14:06:22 localhost cib: [1137]: WARN: cib_init: CCM Activation
failed
Mar 6 14:06:22 localhost cib: [1137]: WARN: cib_init: CCM Connection
failed 1 times (30 max)
Mar 6 14:06:22 localhost heartbeat: [1138]: info: Starting
"/usr/lib/heartbeat/lrmd -r" as uid 0 gid 0 (pid 1138)
Mar 6 14:06:22 localhost lrmd: [1138]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Mar 6 14:06:22 localhost heartbeat: [1139]: info: Starting
"/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 1139)
Mar 6 14:06:22 localhost heartbeat: [1141]: info: Starting
"/usr/lib/heartbeat/crmd" as uid 999 gid 999 (pid 1141)
Mar 6 14:06:22 localhost crmd: [1141]: info: main: CRM Hg Version:
feb1bb614331 tip
Mar 6 14:06:22 localhost crmd: [1141]: WARN: Core dumps could be lost
if multiple dumps occur.
Mar 6 14:06:22 localhost crmd: [1141]: WARN: Consider setting
non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
maximum
supportability
Mar 6 14:06:22 localhost crmd: [1141]: WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
Mar 6 14:06:22 localhost crmd: [1141]: info: crmd_init: Starting crmd
Mar 6 14:06:22 localhost crmd: [1141]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Mar 6 14:06:22 localhost crmd: [1141]: info:
G_main_add_TriggerHandler: Added signal manual handler
Mar 6 14:06:22 localhost crmd: [1141]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Mar 6 14:06:22 localhost heartbeat: [1142]: info: Starting
"/usr/lib/heartbeat/mgmtd -v" as uid 0 gid 0 (pid 1142)
Mar 6 14:06:22 localhost mgmtd: [1142]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Mar 6 14:06:22 localhost mgmtd: [1142]: debug: Enabling coredumps
Mar 6 14:06:22 localhost mgmtd: [1142]: WARN: Core dumps could be
lost if multiple dumps occur.
Mar 6 14:06:22 localhost mgmtd: [1142]: WARN: Consider setting
non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
maximum
supportability
Mar 6 14:06:22 localhost mgmtd: [1142]: WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
Mar 6 14:06:22 localhost mgmtd: [1142]: info:
G_main_add_SignalHandler: Added signal handler for signal 10
Mar 6 14:06:22 localhost mgmtd: [1142]: info:
G_main_add_SignalHandler: Added signal handler for signal 12
Mar 6 14:06:22 localhost mgmtd: [1142]: WARN: lrm_signon: can not
initiate connection
Mar 6 14:06:22 localhost mgmtd: [1142]: info: login to lrm: 0, ret:0
Mar 6 14:06:22 localhost heartbeat: [1140]: info: Starting
"/usr/lib/heartbeat/attrd" as uid 999 gid 999 (pid 1140)
Mar 6 14:06:22 localhost attrd: [1140]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Mar 6 14:06:22 localhost attrd: [1140]: info: register_with_ha:
Hostname: lb3
Mar 6 14:06:22 localhost ccm: [1136]: info: Hostname: lb3
Mar 6 14:06:22 localhost lrmd: [1138]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Mar 6 14:06:22 localhost lrmd: [1138]: WARN: Core dumps could be lost
if multiple dumps occur.
Mar 6 14:06:22 localhost lrmd: [1138]: WARN: Consider setting
non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
maximum
supportability
Mar 6 14:06:22 localhost lrmd: [1138]: WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
Mar 6 14:06:22 localhost lrmd: [1138]: info:
G_main_add_SignalHandler: Added signal handler for signal 10
Mar 6 14:06:22 localhost lrmd: [1138]: info:
G_main_add_SignalHandler: Added signal handler for signal 12
Mar 6 14:06:22 localhost lrmd: [1138]: info: Started.
Mar 6 14:06:22 localhost stonithd: [1139]: WARN: Core dumps could be
lost if multiple dumps occur.
Mar 6 14:06:22 localhost stonithd: [1139]: WARN: Consider setting
non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
maxi
mum supportability
Mar 6 14:06:22 localhost stonithd: [1139]: WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportabil
ity
Mar 6 14:06:22 localhost stonithd: [1139]: info:
G_main_add_SignalHandler: Added signal handler for signal 10
Mar 6 14:06:22 localhost stonithd: [1139]: info:
G_main_add_SignalHandler: Added signal handler for signal 12
Mar 6 14:06:22 localhost stonithd: [1139]: info: Signing in with
heartbeat.
Mar 6 14:06:22 localhost attrd: [1140]: info: register_with_ha: UUID:
083b40b8-c48d-495c-8adf-678a6de826d7
Mar 6 14:06:22 localhost stonithd: [1139]: notice:
/usr/lib/heartbeat/stonithd start up successfully.
Mar 6 14:06:22 localhost stonithd: [1139]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Mar 6 14:06:23 localhost cib: [1137]: WARN: cib_init: CCM Activation
failed
Mar 6 14:06:23 localhost cib: [1137]: WARN: cib_init: CCM Connection
failed 2 times (30 max)
Mar 6 14:06:23 localhost mgmtd: [1142]: info: init_crm
Mar 6 14:06:23 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 13 is max.
Mar 6 14:06:24 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 13 is max.
Mar 6 14:06:24 localhost cib: [1137]: WARN: cib_init: CCM Activation
failed
Mar 6 14:06:24 localhost cib: [1137]: WARN: cib_init: CCM Connection
failed 3 times (30 max)
Mar 6 14:06:24 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 14 is max.
Mar 6 14:06:24 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 14 is max.
Mar 6 14:06:25 localhost cib: [1137]: WARN: cib_init: CCM Activation
failed
Mar 6 14:06:25 localhost cib: [1137]: WARN: cib_init: CCM Connection
failed 4 times (30 max)
Mar 6 14:06:25 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 15 is max.
Mar 6 14:06:25 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 15 is max.
Mar 6 14:06:25 localhost ccm: [1136]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Mar 6 14:06:26 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 17 is max.
Mar 6 14:06:26 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 17 is max.
Mar 6 14:06:26 localhost cib: [1137]: info: cib_init: Starting cib
mainloop
Mar 6 14:06:26 localhost cib: [1137]: info: cib_null_callback:
Setting cib_refresh_notify callbacks for crmd: on
Mar 6 14:06:26 localhost crmd: [1141]: info: do_cib_control: CIB
connection established
Mar 6 14:06:26 localhost cib: [1137]: info:
cib_client_status_callback: Status update: Client lb3/cib now has
status [join]
Mar 6 14:06:26 localhost cib: [1137]: info:
cib_client_status_callback: Status update: Client lb3/cib now has
status [online]
Mar 6 14:06:26 localhost cib: [1143]: info: write_cib_contents: Wrote
version 0.24.2 of the CIB to disk (digest:
200d9f4a3ca4b67ed829ca95b456
1358)
Mar 6 14:06:26 localhost crmd: [1141]: info: register_with_ha:
Hostname: lb3
Mar 6 14:06:26 localhost cib: [1137]: info: cib_null_callback:
Setting cib_diff_notify callbacks for mgmtd: on
Mar 6 14:06:27 localhost crmd: [1141]: info: register_with_ha: UUID:
083b40b8-c48d-495c-8adf-678a6de826d7
Mar 6 14:06:27 localhost mgmtd: [1142]: debug: main: run the loop...
Mar 6 14:06:27 localhost mgmtd: [1142]: info: Started.
Mar 6 14:06:27 localhost crmd: [1141]: info: populate_cib_nodes:
Requesting the list of configured nodes
Mar 6 14:06:28 localhost cib: [1137]: info:
cib_client_status_callback: Status update: Client lb2/cib now has
status [online]
Mar 6 14:06:28 localhost crmd: [1141]: notice: populate_cib_nodes:
Node: lb3 (uuid: 083b40b8-c48d-495c-8adf-678a6de826d7)
Mar 6 14:06:29 localhost crmd: [1141]: notice: populate_cib_nodes:
Node: lb2 (uuid: 7e3eabe7-6820-4ef2-8ef0-41ae7cae5cd6)
Mar 6 14:06:29 localhost crmd: [1141]: info: do_ha_control: Connected
to Heartbeat
Mar 6 14:06:29 localhost crmd: [1141]: info: do_ccm_control: CCM
connection established... waiting for first callback
Mar 6 14:06:29 localhost crmd: [1141]: info: do_started: Delaying
start, CCM (0000000000100000) not connected
Mar 6 14:06:29 localhost crmd: [1141]: info: crmd_init: Starting
crmd's mainloop
Mar 6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
default value '10s' for cluster option 'dc_deadtime'
Mar 6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
default value '0' for cluster option 'cluster_recheck_interval'
Mar 6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
default value '2min' for cluster option 'election_timeout'
Mar 6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
default value '20min' for cluster option 'shutdown_escalation'
Mar 6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
default value '3min' for cluster option 'crmd-integration-timeout'
Mar 6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
default value '10min' for cluster option 'crmd-finalization-timeout'
Mar 6 14:06:29 localhost crmd: [1141]: notice:
crmd_client_status_callback: Status update: Client lb3/crmd now has
status [online]
Mar 6 14:06:29 localhost crmd: [1141]: notice:
crmd_client_status_callback: Status update: Client lb3/crmd now has
status [online]
Mar 6 14:06:29 localhost crmd: [1141]: info: do_started: Delaying
start, CCM (0000000000100000) not connected
Mar 6 14:06:30 localhost crmd: [1141]: notice:
crmd_client_status_callback: Status update: Client lb2/crmd now has
status [online]
Mar 6 14:06:30 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 27 is max.
Mar 6 14:06:30 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 27 is max.
Mar 6 14:06:30 localhost crmd: [1141]: info: do_started: Delaying
start, CCM (0000000000100000) not connected
Mar 6 14:06:31 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 28 is max.
Mar 6 14:06:31 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 28 is max.
Mar 6 14:06:31 localhost attrd: [1140]: info: main: Starting
mainloop...
Mar 6 14:06:32 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 28 is max.
Mar 6 14:06:32 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 29 is max.
Mar 6 14:06:33 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 29 is max.
Mar 6 14:06:33 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 29 is max.
Mar 6 14:06:34 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 31 is max.
Mar 6 14:06:34 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 31 is max.
Mar 6 14:06:35 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 31 is max.
Mar 6 14:06:35 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 32 is max.
Mar 6 14:06:36 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 32 is max.
Mar 6 14:06:36 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 32 is max.
Mar 6 14:06:37 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 33 is max.
Mar 6 14:06:37 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 33 is max.
Mar 6 14:06:38 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 35 is max.
Mar 6 14:06:38 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 36 is max.
Mar 6 14:06:39 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 36 is max.
Mar 6 14:06:39 localhost heartbeat: [1126]: WARN: Rexmit of seq
668083 requested. 36 is max.
Mar 6 14:06:40 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 37 is max.
Mar 6 14:06:40 localhost heartbeat: [1126]: WARN: Rexmit of seq
668072 requested. 37 is max.
Mar 6 14:06:41 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 38 is max.
Mar 6 14:06:41 localhost heartbeat: [1126]: WARN: Rexmit of seq
668078 requested. 38 is max.
Mar 6 14:06:42 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 40 is max.
Mar 6 14:06:42 localhost heartbeat: [1126]: WARN: Rexmit of seq
668082 requested. 40 is max.
and so till i shutdown it...
one note: i deleted all the resources from the slave server (cibadmin
-E) , since when i started it yesterday (strangely, heartbeat wasnt
running), it was disconnected from the other node and it added the
same virtual ips that the other node is running.
i dont have logs from more than 5 days or so... i didnt find something
useful there...
im running this on a debian etch with packages compiled by myself
(without special options...), the heartbeat package is
heartbeat-2.1.2, and this are my config files:
ha.cf :
bcast eth3
serial /dev/ttyS0
udpport 694
deadtime 10
node lb2 lb3
# lo comento hasta q sepamos usarlo
# use_logd yes
crm yes
-----
lb.xml:
<cib>
<configuration>
<crm_config/>
<nodes/>
<resources>
<group id="lb_group-web">
<primitive id="rsc_ip-ipexterna" class="ocf" type="IPaddr2"
provider="heartbeat">
<instance_attributes>
<attributes>
<nvpair id="127" name="ip"
value="10.60.5.11"/>
<nvpair id="128" name="netmask"
value="22"/>
<nvpair id="129" name="nic"
value="eth1"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="rsc_ip-web" class="ocf" type="IPaddr2"
provider="heartbeat">
<instance_attributes>
<attributes>
<nvpair id="109" name="ip"
value="192.168.201.1"/>
<nvpair id="110" name="netmask"
value="24"/>
<nvpair id="111" name="nic"
value="eth2"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="rsc_ip-db" class="ocf" type="IPaddr2"
provider="heartbeat">
<instance_attributes>
<attributes>
<nvpair id="123" name="ip"
value="192.168.206.1"/>
<nvpair id="124" name="netmask"
value="24"/>
<nvpair id="125" name="nic"
value="eth0"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="rsc_ldirector" class="lsb" type="ldirectord" />
</group>
</resources>
<constraints>
<rsc_location id="run_group-web" rsc="lb_group-web">
<rule id="pref_run_lg_group-web" score="100">
<expression attribute="#uname" operation="eq"
value="lb2"/>
</rule>
</rsc_location>
</constraints>
</configuration>
<status/>
</cib>
and we load this config with:
#!/bin/sh
CONF=/etc/ha.d/lb.xml
cibadmin -E
cibadmin -C -x $CONF
and the authkeys is the same in both nodes..
-rw------- 1 root root 623 2008-01-30 19:11 authkeys
i really have no idea where to start looking for the cause of all of
this problems... (besides my co-worker... :D )
any advice is welcome...
thanks!
--
Roberto Scattini
___ _
))_) __ )L __
((__)(('(( ((_)
------------------------------
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
End of Linux-HA Digest, Vol 52, Issue 18
****************************************
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems