Hi,

On Thu, Mar 06, 2008 at 04:24:58PM -0200, Roberto Scattini wrote:
> list:
> 
> i have a problem. i had a simple two node (lb2 and lb3) heartbeat v2
> config working fine. i have two boxes, with 4 interfaces each. one
> crossover cable between boxes and the nullmodem serial cable on
> /dev/ttyS0. on the other interfaces of each box i have a real and a
> virtual ip. heartbeat manages those virtual ips.
> it was working fine, but one day one of my co-workers failed ( :S )
> with a script that deleted /etc of the primary server.
> 
> the heartbeat didnt switched to the other machine (i think because the
> particular problem), so the primary server was shutdown the hard
> way...
> 
> then, the heartbeat switched the virtual ips, just as it was expected...
> 
> later my co-worker restored /etc from a backup done by the same script
> that had deleted it ( :D ) and the virtual ips gone back to that
> server since it was the preferred node...
> this monday i came back from my vacations (yes, all this happened
> during my vacations...) and yesterday i found that the heartbeat of
> the primary server has 100% CPU usage and logs all the time:
> 
> 
> Mar  6 11:58:48 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request took
> too long to execute: 870 ms (> 10 ms) (GSource: 0x1dd38138)
> Mar  6 11:58:48 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for send local status was
> delayed 3130 ms (> 510 ms) before being called (GSource: 0x80fd0a8)
> Mar  6 11:58:48 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839940551 should have started at
> 1839940238
> Mar  6 11:58:48 localhost heartbeat: [2683]: ERROR: Message hist queue
> is filling up (200 messages in queue)
> Mar  6 11:58:48 localhost heartbeat: [2683]: debug: hist->ackseq =519517
> Mar  6 11:58:48 localhost heartbeat: [2683]: debug: hist->lowseq
> =519389, hist->hiseq=519589
> Mar  6 11:58:48 localhost heartbeat: [2683]: debug: expecting from lb3
> Mar  6 11:58:48 localhost heartbeat: [2683]: debug: it's ackseq=519517
> Mar  6 11:58:48 localhost heartbeat: [2683]: debug:
> Mar  6 11:58:48 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request was
> delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dcfd4b8)
> Mar  6 11:58:48 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839940617 should have started at
> 1839940229
> Mar  6 11:58:49 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request took
> too long to execute: 860 ms (> 10 ms) (GSource: 0x1dcfd4b8)
> Mar  6 11:58:49 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request was
> delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dd38268)
> Mar  6 11:58:49 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839940703 should have started at
> 1839940315
> Mar  6 11:58:50 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request took
> too long to execute: 870 ms (> 10 ms) (GSource: 0x1dd38268)
> Mar  6 11:58:50 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request was
> delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dd38300)
> Mar  6 11:58:50 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839940790 should have started at
> 1839940402
> Mar  6 11:58:51 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request took
> too long to execute: 870 ms (> 10 ms) (GSource: 0x1dd38300)
> Mar  6 11:58:51 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request was
> delayed 3880 ms (> 500 ms) before being called (GSource: 0x1dd38398)
> Mar  6 11:58:51 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839940877 should have started at
> 1839940489
> Mar  6 11:58:52 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request took
> too long to execute: 860 ms (> 10 ms) (GSource: 0x1dd38398)
> Mar  6 11:58:52 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for send local status was
> delayed 3120 ms (> 510 ms) before being called (GSource: 0x80fd0a8)
> Mar  6 11:58:52 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839940963 should have started at
> 1839940651
> Mar  6 11:58:52 localhost heartbeat: [2683]: ERROR: Message hist queue
> is filling up (200 messages in queue)
> Mar  6 11:58:52 localhost heartbeat: [2683]: debug: hist->ackseq =519517
> Mar  6 11:58:52 localhost heartbeat: [2683]: debug: hist->lowseq
> =519390, hist->hiseq=519590
> Mar  6 11:58:52 localhost heartbeat: [2683]: debug: expecting from lb3
> Mar  6 11:58:52 localhost heartbeat: [2683]: debug: it's ackseq=519517
> Mar  6 11:58:52 localhost heartbeat: [2683]: debug:
> Mar  6 11:58:52 localhost heartbeat: [2683]: WARN:
> Gmain_timeout_dispatch: Dispatch function for retransmit request was
> delayed 3650 ms (> 500 ms) before being called (GSource: 0x1dd1adf0)
> Mar  6 11:58:52 localhost heartbeat: [2683]: info:
> Gmain_timeout_dispatch: started at 1839941007 should have started at
> 1839940642
> 
> yesterday also discovered that the network connection between two
> nodes with the crossover cable was unplugged (now is working fine)
> today, restarted and deleted all the resources from the slave server
> (lb3)... but heartbeat dont connects to the primary node... it says
> this:
> 
> 
> Mar  6 14:06:20 localhost logd: [1104]: info: logd started with
> default configuration.
> Mar  6 14:06:20 localhost logd: [1104]: WARN: Core dumps could be lost
> if multiple dumps occur.
> Mar  6 14:06:20 localhost logd: [1104]: WARN: Consider setting
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
> maximum
> supportability
> Mar  6 14:06:20 localhost logd: [1104]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
> supportability
> Mar  6 14:06:20 localhost logd: [1105]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Mar  6 14:06:20 localhost logd: [1104]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Mar  6 14:06:20 localhost heartbeat: [1125]: info: No log entry found
> in ha.cf -- use logd
> Mar  6 14:06:20 localhost heartbeat: [1125]: info: Enabling logging daemon
> Mar  6 14:06:20 localhost heartbeat: [1125]: info: logfile and debug
> file are those specified in logd config file (default /etc/logd.cf)
> Mar  6 14:06:20 localhost heartbeat: [1125]: info: **************************
> Mar  6 14:06:20 localhost heartbeat: [1125]: info: Configuration
> validated. Starting heartbeat 2.1.2
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: heartbeat: version 2.1.2
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: Heartbeat
> generation: 1201727554
> Mar  6 14:06:20 localhost heartbeat: [1126]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Mar  6 14:06:20 localhost heartbeat: [1126]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: Removing
> /var/run/heartbeat/rsctmp failed, recreating.
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: glib: UDP Broadcast
> heartbeat started on port 694 (694) interface eth3
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: glib: UDP Broadcast
> heartbeat closed on port 694 interface eth3 - Status: 1
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: glib: Starting
> serial heartbeat on tty /dev/ttyS0 (19200 baud)
> Mar  6 14:06:20 localhost heartbeat: [1126]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Mar  6 14:06:20 localhost heartbeat: [1126]: info: Local status now set to: 
> 'up'
> Mar  6 14:06:21 localhost heartbeat: [1126]: info: Link lb2:eth3 up.
> Mar  6 14:06:21 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 4 is max.
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: ha_msg_addraw_ll:
> illegal field
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: ha_msg_addraw():
> ha_msg_addraw_ll failed
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: NV failure 
> (string2msg_ll):
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: Input string: [>>>
> t=NS_rexmit >>> t=NS_rexmit dest=lb3 firstseq=668078 lastseq=668078
> (1)
> destuuid=CDtAuMSNSVyK32eKbegm1w== src=lb2
> (1)srcuuid=fj6r52ggTvKO8EGufK5c1g== hg=479df638 ts=47cfec7c ttl=3
> auth=1 c47fe2984873891a3d839df5e25
> fe9c9fbb4eafa <<< ]
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: sp=>>> t=NS_rexmit
> dest=lb3 firstseq=668078 lastseq=668078
> (1)destuuid=CDtAuMSNSVyK32eKbeg
> m1w== src=lb2 (1)srcuuid=fj6r52ggTvKO8EGufK5c1g== hg=479df638
> ts=47cfec7c ttl=3 auth=1 c47fe2984873891a3d839df5e25fe9c9fbb4eafa <<<
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: depth=0
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: MSG: Dumping
> message with 1 fields
> Mar  6 14:06:21 localhost heartbeat: [1126]: ERROR: MSG[0] : [t=NS_rexmit]
> Mar  6 14:06:21 localhost heartbeat: [1126]: info: Link lb2:/dev/ttyS0 up.
> Mar  6 14:06:21 localhost heartbeat: [1126]: info: Status update for
> node lb2: status active
> Mar  6 14:06:21 localhost heartbeat: [1126]: info: Link lb3:eth3 up.
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Comm_now_up():
> updating status to active
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Local status now
> set to: 'active'
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/ccm" (999,999)
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/cib" (999,999)
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/lrmd -r" (0,0)
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/stonithd" (0,0)
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/attrd" (999,999)
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/crmd" (999,999)
> Mar  6 14:06:22 localhost heartbeat: [1126]: info: Starting child
> client "/usr/lib/heartbeat/mgmtd -v" (0,0)
> Mar  6 14:06:22 localhost heartbeat: [1136]: info: Starting
> "/usr/lib/heartbeat/ccm" as uid 999  gid 999 (pid 1136)
> Mar  6 14:06:22 localhost heartbeat: [1137]: info: Starting
> "/usr/lib/heartbeat/cib" as uid 999  gid 999 (pid 1137)
> Mar  6 14:06:22 localhost cib: [1137]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Mar  6 14:06:22 localhost cib: [1137]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Mar  6 14:06:22 localhost cib: [1137]: info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> Mar  6 14:06:22 localhost cib: [1137]: info: main: Retrieval of a
> per-action CIB: disabled
> Mar  6 14:06:22 localhost cib: [1137]: info: readCibXmlFile: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.xml
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk] <cib generated="false" admin_epoch="0"
> have_quorum="f
> alse" ignore_dtd="false" num_peers="2" cib_feature_revision="1.3"
> epoch="24" num_updates="2" cib-last-written="Thu Mar  6 14:03:59 2008"
> ccm_t
> ransition="1">
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]   <configuration>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]     <crm_config/>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]     <nodes>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]       <node
> id="083b40b8-c48d-495c-8adf-678a6de826d7"
>  uname="lb3" type="normal"/>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]       <node
> id="7e3eabe7-6820-4ef2-8ef0-41ae7cae5cd6"
>  uname="lb2" type="normal"/>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]     </nodes>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]     <resources/>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]     <constraints/>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]   </configuration>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk]   <status/>
> Mar  6 14:06:22 localhost cib: [1137]: info: log_data_element:
> readCibXmlFile: [on-disk] </cib>
> Mar  6 14:06:22 localhost cib: [1137]: info: startCib: CIB
> Initialization completed successfully
> Mar  6 14:06:22 localhost cib: [1137]: info: cib_register_ha: Signing
> in with Heartbeat
> Mar  6 14:06:22 localhost cib: [1137]: info: cib_register_ha: FSA Hostname: 
> lb3
> Mar  6 14:06:22 localhost cib: [1137]: WARN: cib_init: CCM Activation failed
> Mar  6 14:06:22 localhost cib: [1137]: WARN: cib_init: CCM Connection
> failed 1 times (30 max)
> Mar  6 14:06:22 localhost heartbeat: [1138]: info: Starting
> "/usr/lib/heartbeat/lrmd -r" as uid 0  gid 0 (pid 1138)
> Mar  6 14:06:22 localhost lrmd: [1138]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Mar  6 14:06:22 localhost heartbeat: [1139]: info: Starting
> "/usr/lib/heartbeat/stonithd" as uid 0  gid 0 (pid 1139)
> Mar  6 14:06:22 localhost heartbeat: [1141]: info: Starting
> "/usr/lib/heartbeat/crmd" as uid 999  gid 999 (pid 1141)
> Mar  6 14:06:22 localhost crmd: [1141]: info: main: CRM Hg Version:
> feb1bb614331 tip
> Mar  6 14:06:22 localhost crmd: [1141]: WARN: Core dumps could be lost
> if multiple dumps occur.
> Mar  6 14:06:22 localhost crmd: [1141]: WARN: Consider setting
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
> maximum
> supportability
> Mar  6 14:06:22 localhost crmd: [1141]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
> supportability
> Mar  6 14:06:22 localhost crmd: [1141]: info: crmd_init: Starting crmd
> Mar  6 14:06:22 localhost crmd: [1141]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Mar  6 14:06:22 localhost crmd: [1141]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Mar  6 14:06:22 localhost crmd: [1141]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Mar  6 14:06:22 localhost heartbeat: [1142]: info: Starting
> "/usr/lib/heartbeat/mgmtd -v" as uid 0  gid 0 (pid 1142)
> Mar  6 14:06:22 localhost mgmtd: [1142]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Mar  6 14:06:22 localhost mgmtd: [1142]: debug: Enabling coredumps
> Mar  6 14:06:22 localhost mgmtd: [1142]: WARN: Core dumps could be
> lost if multiple dumps occur.
> Mar  6 14:06:22 localhost mgmtd: [1142]: WARN: Consider setting
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
> maximum
>  supportability
> Mar  6 14:06:22 localhost mgmtd: [1142]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
> supportability
> Mar  6 14:06:22 localhost mgmtd: [1142]: info:
> G_main_add_SignalHandler: Added signal handler for signal 10
> Mar  6 14:06:22 localhost mgmtd: [1142]: info:
> G_main_add_SignalHandler: Added signal handler for signal 12
> Mar  6 14:06:22 localhost mgmtd: [1142]: WARN: lrm_signon: can not
> initiate connection
> Mar  6 14:06:22 localhost mgmtd: [1142]: info: login to lrm: 0, ret:0
> Mar  6 14:06:22 localhost heartbeat: [1140]: info: Starting
> "/usr/lib/heartbeat/attrd" as uid 999  gid 999 (pid 1140)
> Mar  6 14:06:22 localhost attrd: [1140]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Mar  6 14:06:22 localhost attrd: [1140]: info: register_with_ha: Hostname: lb3
> Mar  6 14:06:22 localhost ccm: [1136]: info: Hostname: lb3
> Mar  6 14:06:22 localhost lrmd: [1138]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Mar  6 14:06:22 localhost lrmd: [1138]: WARN: Core dumps could be lost
> if multiple dumps occur.
> Mar  6 14:06:22 localhost lrmd: [1138]: WARN: Consider setting
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
> maximum
> supportability
> Mar  6 14:06:22 localhost lrmd: [1138]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
> supportability
> Mar  6 14:06:22 localhost lrmd: [1138]: info:
> G_main_add_SignalHandler: Added signal handler for signal 10
> Mar  6 14:06:22 localhost lrmd: [1138]: info:
> G_main_add_SignalHandler: Added signal handler for signal 12
> Mar  6 14:06:22 localhost lrmd: [1138]: info: Started.
> Mar  6 14:06:22 localhost stonithd: [1139]: WARN: Core dumps could be
> lost if multiple dumps occur.
> Mar  6 14:06:22 localhost stonithd: [1139]: WARN: Consider setting
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
> maxi
> mum supportability
> Mar  6 14:06:22 localhost stonithd: [1139]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
> supportabil
> ity
> Mar  6 14:06:22 localhost stonithd: [1139]: info:
> G_main_add_SignalHandler: Added signal handler for signal 10
> Mar  6 14:06:22 localhost stonithd: [1139]: info:
> G_main_add_SignalHandler: Added signal handler for signal 12
> Mar  6 14:06:22 localhost stonithd: [1139]: info: Signing in with heartbeat.
> Mar  6 14:06:22 localhost attrd: [1140]: info: register_with_ha: UUID:
> 083b40b8-c48d-495c-8adf-678a6de826d7
> Mar  6 14:06:22 localhost stonithd: [1139]: notice:
> /usr/lib/heartbeat/stonithd start up successfully.
> Mar  6 14:06:22 localhost stonithd: [1139]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Mar  6 14:06:23 localhost cib: [1137]: WARN: cib_init: CCM Activation failed
> Mar  6 14:06:23 localhost cib: [1137]: WARN: cib_init: CCM Connection
> failed 2 times (30 max)
> Mar  6 14:06:23 localhost mgmtd: [1142]: info: init_crm
> Mar  6 14:06:23 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 13 is max.
> Mar  6 14:06:24 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 13 is max.
> Mar  6 14:06:24 localhost cib: [1137]: WARN: cib_init: CCM Activation failed
> Mar  6 14:06:24 localhost cib: [1137]: WARN: cib_init: CCM Connection
> failed 3 times (30 max)
> Mar  6 14:06:24 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 14 is max.
> Mar  6 14:06:24 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 14 is max.
> Mar  6 14:06:25 localhost cib: [1137]: WARN: cib_init: CCM Activation failed
> Mar  6 14:06:25 localhost cib: [1137]: WARN: cib_init: CCM Connection
> failed 4 times (30 max)
> Mar  6 14:06:25 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 15 is max.
> Mar  6 14:06:25 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 15 is max.
> Mar  6 14:06:25 localhost ccm: [1136]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Mar  6 14:06:26 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 17 is max.
> Mar  6 14:06:26 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 17 is max.
> Mar  6 14:06:26 localhost cib: [1137]: info: cib_init: Starting cib mainloop
> Mar  6 14:06:26 localhost cib: [1137]: info: cib_null_callback:
> Setting cib_refresh_notify callbacks for crmd: on
> Mar  6 14:06:26 localhost crmd: [1141]: info: do_cib_control: CIB
> connection established
> Mar  6 14:06:26 localhost cib: [1137]: info:
> cib_client_status_callback: Status update: Client lb3/cib now has
> status [join]
> Mar  6 14:06:26 localhost cib: [1137]: info:
> cib_client_status_callback: Status update: Client lb3/cib now has
> status [online]
> Mar  6 14:06:26 localhost cib: [1143]: info: write_cib_contents: Wrote
> version 0.24.2 of the CIB to disk (digest:
> 200d9f4a3ca4b67ed829ca95b456
> 1358)
> Mar  6 14:06:26 localhost crmd: [1141]: info: register_with_ha: Hostname: lb3
> Mar  6 14:06:26 localhost cib: [1137]: info: cib_null_callback:
> Setting cib_diff_notify callbacks for mgmtd: on
> Mar  6 14:06:27 localhost crmd: [1141]: info: register_with_ha: UUID:
> 083b40b8-c48d-495c-8adf-678a6de826d7
> Mar  6 14:06:27 localhost mgmtd: [1142]: debug: main: run the loop...
> Mar  6 14:06:27 localhost mgmtd: [1142]: info: Started.
> Mar  6 14:06:27 localhost crmd: [1141]: info: populate_cib_nodes:
> Requesting the list of configured nodes
> Mar  6 14:06:28 localhost cib: [1137]: info:
> cib_client_status_callback: Status update: Client lb2/cib now has
> status [online]
> Mar  6 14:06:28 localhost crmd: [1141]: notice: populate_cib_nodes:
> Node: lb3 (uuid: 083b40b8-c48d-495c-8adf-678a6de826d7)
> Mar  6 14:06:29 localhost crmd: [1141]: notice: populate_cib_nodes:
> Node: lb2 (uuid: 7e3eabe7-6820-4ef2-8ef0-41ae7cae5cd6)
> Mar  6 14:06:29 localhost crmd: [1141]: info: do_ha_control: Connected
> to Heartbeat
> Mar  6 14:06:29 localhost crmd: [1141]: info: do_ccm_control: CCM
> connection established... waiting for first callback
> Mar  6 14:06:29 localhost crmd: [1141]: info: do_started: Delaying
> start, CCM (0000000000100000) not connected
> Mar  6 14:06:29 localhost crmd: [1141]: info: crmd_init: Starting
> crmd's mainloop
> Mar  6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
> default value '10s' for cluster option 'dc_deadtime'
> Mar  6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
> default value '0' for cluster option 'cluster_recheck_interval'
> Mar  6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
> default value '2min' for cluster option 'election_timeout'
> Mar  6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
> default value '20min' for cluster option 'shutdown_escalation'
> Mar  6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
> default value '3min' for cluster option 'crmd-integration-timeout'
> Mar  6 14:06:29 localhost crmd: [1141]: notice: cluster_option: Using
> default value '10min' for cluster option 'crmd-finalization-timeout'
> Mar  6 14:06:29 localhost crmd: [1141]: notice:
> crmd_client_status_callback: Status update: Client lb3/crmd now has
> status [online]
> Mar  6 14:06:29 localhost crmd: [1141]: notice:
> crmd_client_status_callback: Status update: Client lb3/crmd now has
> status [online]
> Mar  6 14:06:29 localhost crmd: [1141]: info: do_started: Delaying
> start, CCM (0000000000100000) not connected
> Mar  6 14:06:30 localhost crmd: [1141]: notice:
> crmd_client_status_callback: Status update: Client lb2/crmd now has
> status [online]
> Mar  6 14:06:30 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 27 is max.
> Mar  6 14:06:30 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 27 is max.
> Mar  6 14:06:30 localhost crmd: [1141]: info: do_started: Delaying
> start, CCM (0000000000100000) not connected
> Mar  6 14:06:31 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 28 is max.
> Mar  6 14:06:31 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 28 is max.
> Mar  6 14:06:31 localhost attrd: [1140]: info: main: Starting mainloop...
> Mar  6 14:06:32 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 28 is max.
> Mar  6 14:06:32 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 29 is max.
> Mar  6 14:06:33 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 29 is max.
> Mar  6 14:06:33 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 29 is max.
> Mar  6 14:06:34 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 31 is max.
> Mar  6 14:06:34 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 31 is max.
> Mar  6 14:06:35 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 31 is max.
> Mar  6 14:06:35 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 32 is max.
> Mar  6 14:06:36 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 32 is max.
> Mar  6 14:06:36 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 32 is max.
> Mar  6 14:06:37 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 33 is max.
> Mar  6 14:06:37 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 33 is max.
> Mar  6 14:06:38 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 35 is max.
> Mar  6 14:06:38 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 36 is max.
> Mar  6 14:06:39 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 36 is max.
> Mar  6 14:06:39 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668083 requested. 36 is max.
> Mar  6 14:06:40 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 37 is max.
> Mar  6 14:06:40 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668072 requested. 37 is max.
> Mar  6 14:06:41 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 38 is max.
> Mar  6 14:06:41 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668078 requested. 38 is max.
> Mar  6 14:06:42 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 40 is max.
> Mar  6 14:06:42 localhost heartbeat: [1126]: WARN: Rexmit of seq
> 668082 requested. 40 is max.
> 
> and so till i shutdown it...
> 
> one note: i deleted all the resources from the slave server (cibadmin
> -E) , since when i started it yesterday (strangely, heartbeat wasnt
> running), it was disconnected from the other node and it added the
> same virtual ips that the other node is running.
> 
> i dont have logs from more than 5 days or so... i didnt find something
> useful there...
> 
> im running this on a debian etch with packages compiled by myself
> (without special options...), the heartbeat package is
> heartbeat-2.1.2, and this are my config files:
> 
> ha.cf :
> bcast eth3
> serial /dev/ttyS0

Which baud rate do you have set for the serial? It should be at
least 115200. Perhaps that could help as this seems to be a
communications problem.

Thanks,

Dejan

> udpport 694
> deadtime 10
> node lb2 lb3
> # lo comento hasta q sepamos usarlo
> # use_logd yes
> crm yes
> -----
> lb.xml:
> <cib>
>   <configuration>
>     <crm_config/>
>     <nodes/>
>     <resources>
>         <group id="lb_group-web">
>         <primitive id="rsc_ip-ipexterna" class="ocf" type="IPaddr2"
> provider="heartbeat">
>                 <instance_attributes>
>                         <attributes>
>                                 <nvpair id="127" name="ip" 
> value="10.60.5.11"/>
>                                 <nvpair id="128" name="netmask" value="22"/>
>                                 <nvpair id="129" name="nic" value="eth1"/>
>                         </attributes>
>                 </instance_attributes>
>         </primitive>
>         <primitive id="rsc_ip-web" class="ocf" type="IPaddr2"
> provider="heartbeat">
>                 <instance_attributes>
>                         <attributes>
>                                 <nvpair id="109" name="ip"
> value="192.168.201.1"/>
>                                 <nvpair id="110" name="netmask" value="24"/>
>                                 <nvpair id="111" name="nic" value="eth2"/>
>                         </attributes>
>                 </instance_attributes>
>         </primitive>
>         <primitive id="rsc_ip-db" class="ocf" type="IPaddr2"
> provider="heartbeat">
>                 <instance_attributes>
>                         <attributes>
>                                 <nvpair id="123" name="ip"
> value="192.168.206.1"/>
>                                 <nvpair id="124" name="netmask" value="24"/>
>                                 <nvpair id="125" name="nic" value="eth0"/>
>                         </attributes>
>                 </instance_attributes>
>         </primitive>
>         <primitive id="rsc_ldirector" class="lsb" type="ldirectord" />
>         </group>
>     </resources>
>     <constraints>
>                 <rsc_location id="run_group-web" rsc="lb_group-web">
>                 <rule id="pref_run_lg_group-web" score="100">
>                         <expression attribute="#uname" operation="eq"
> value="lb2"/>
>                 </rule>
>                 </rsc_location>
>     </constraints>
>   </configuration>
>   <status/>
> </cib>
> 
> 
> and we load this config with:
> 
> #!/bin/sh
> CONF=/etc/ha.d/lb.xml
> cibadmin -E
> cibadmin -C -x $CONF
> 
> 
> and the authkeys is the same in both nodes..
> 
> -rw------- 1 root root  623 2008-01-30 19:11 authkeys
> 
> 
> 
> i really have no idea where to start looking for the cause of all of
> this problems... (besides my co-worker... :D )
> 
> any advice is welcome...
> 
> 
> thanks!
> 
> -- 
> Roberto Scattini
>  ___     _
>  ))_) __ )L __
> ((__)(('(( ((_)
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to