So, I've been following the instructions in the FAQ for setting a suitable deadtime and have noticed one thing that seems a little odd to me - it seems that whenever a node reboots and rejoins the cluster, heartbeat on the other node sees the new generation number and I then often see a late heartbeat message on BOTH nodes where the time is between 3 and 4s.
I am running 2.1.0 with keepalive=1s and warntime=3s and have attached some log extracts from the relevant times. Any ideas why restarting hb should result in a 3+s delay like this? Am I being too picky to expect this not to happen? Thanks, Simon On the node that does not reboot, I see output like this in the log file when the other node restarts: Jun 23 04:50:32 lisa heartbeat: [7887]: info: Heartbeat restart on node bart.sn.stratus.com Jun 23 04:50:32 lisa heartbeat: [7887]: info: Link bart.sn.stratus.com:eth0 up. Jun 23 04:50:32 lisa heartbeat: [7887]: info: Status update for node bart.sn.stratus.com: status init Jun 23 04:50:32 lisa heartbeat: [7887]: info: Link bart.sn.stratus.com:eth1 up. Jun 23 04:50:32 lisa heartbeat: [7887]: info: Status update for node bart.sn.stratus.com: status up Jun 23 04:50:32 lisa heartbeat: [7887]: debug: get_delnodelist: delnodelist= Jun 23 04:50:32 lisa heartbeat: [7887]: info: all clients are now paused Jun 23 04:50:32 lisa heartbeat: [7887]: debug: hist->ackseq =8523 Jun 23 04:50:32 lisa heartbeat: [7887]: debug: hist->lowseq =8522, hist->hiseq=8624 Jun 23 04:50:32 lisa heartbeat: [7887]: debug: Jun 23 04:50:32 lisa crmd: [7962]: notice: crmd_ha_status_callback: Status update: Node bart.sn.stratus.com now has status [init] Jun 23 04:50:32 lisa crmd: [7962]: info: crmd_ha_status_callback: Ping node bart.sn.stratus.com is init Jun 23 04:50:32 lisa crmd: [7962]: notice: crmd_ha_status_callback: Status update: Node bart.sn.stratus.com now has status [up] Jun 23 04:50:32 lisa crmd: [7962]: info: crmd_ha_status_callback: Ping node bart.sn.stratus.com is up Jun 23 04:50:32 lisa heartbeat: [7887]: debug: hist->ackseq =8523 Jun 23 04:50:32 lisa heartbeat: [7887]: debug: hist->lowseq =8522, hist->hiseq=8625 Jun 23 04:50:32 lisa heartbeat: [7887]: debug: expecting from bart.sn.stratus.com Jun 23 04:50:32 lisa heartbeat: [7887]: debug: it's ackseq=0 Jun 23 04:50:32 lisa heartbeat: [7887]: debug: Jun 23 04:50:32 lisa heartbeat: [7887]: info: all clients are now resumed Jun 23 04:50:35 lisa heartbeat: [7887]: WARN: Late heartbeat: Node bart.sn.stratus.com: interval 3020 ms Jun 23 04:50:35 lisa heartbeat: [7887]: info: Status update for node bart.sn.stratus.com: status active Jun 23 04:50:35 lisa crmd: [7962]: notice: crmd_ha_status_callback: Status update: Node bart.sn.stratus.com now has status [active] Jun 23 04:50:35 lisa cib: [7958]: info: cib_diff_notify: Local-only Change (client:7962, call: 51): 0.2.68 (ok) Jun 23 04:50:35 lisa tengine: [10395]: info: te_update_diff: Processing diff (cib_update): 0.2.68 -> 0.2.68 Jun 23 04:50:35 lisa cib: [12691]: info: write_cib_contents: Wrote version 0.2.68 of the CIB to disk (digest: 9c9b0cce1f2884c4e6b1f98163dd8c2e) Jun 23 04:50:36 lisa cib: [7958]: info: cib_client_status_callback: Status update: Client bart.sn.stratus.com/cib now has status [join] Jun 23 04:50:37 lisa heartbeat: [7887]: WARN: 1 lost packet(s) for [bart.sn.stratus.com] [14:16] Jun 23 04:50:37 lisa heartbeat: [7887]: info: No pkts missing from bart.sn.stratus.com! Jun 23 04:50:37 lisa crmd: [7962]: notice: crmd_client_status_callback: Status update: Client bart.sn.stratus.com/crmd now has status [online] Jun 23 04:50:37 lisa crmd: [7962]: info: crmd_client_status_callback: Uncaching UUID for bart.sn.stratus.com Meanwhile, on the node that just rebooted, I see: Jun 23 04:50:30 bart heartbeat: [7977]: info: Enabling logging daemon Jun 23 04:50:30 bart heartbeat: [7977]: info: logfile and debug file are those specified in logd config file (default /etc/logd.cf) Jun 23 04:50:30 bart heartbeat: [7977]: info: Version 2 support: true Jun 23 04:50:31 bart heartbeat: [7977]: info: ************************** Jun 23 04:50:31 bart heartbeat: [7977]: info: Configuration validated. Starting heartbeat 2.0.9 Jun 23 04:50:31 bart heartbeat: [7982]: info: heartbeat: version 2.0.9 Jun 23 04:50:31 bart heartbeat: [7982]: info: Heartbeat generation: 1182588631 Jun 23 04:50:31 bart heartbeat: [7982]: info: G_main_add_TriggerHandler: Added signal manual handler Jun 23 04:50:31 bart heartbeat: [7982]: info: G_main_add_TriggerHandler: Added signal manual handler Jun 23 04:50:31 bart heartbeat: [7982]: info: Removing /var/run/heartbeat/rsctmp failed, recreating. Jun 23 04:50:31 bart heartbeat: [7982]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 Jun 23 04:50:31 bart heartbeat: [7982]: info: glib: ucast: bound send socket to device: eth0 Jun 23 04:50:31 bart heartbeat: [7982]: info: glib: ucast: bound receive socket to device: eth0 Jun 23 04:50:31 bart heartbeat: [7982]: info: glib: ucast: started on port 694 interface eth0 to 134.111.30.58 Jun 23 04:50:31 bart heartbeat: [7982]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 Jun 23 04:50:31 bart heartbeat: [7982]: info: glib: ucast: bound send socket to device: eth1 Jun 23 04:50:32 bart heartbeat: [7982]: info: glib: ucast: bound receive socket to device: eth1 Jun 23 04:50:32 bart heartbeat: [7982]: info: glib: ucast: started on port 694 interface eth1 to 192.168.1.101 Jun 23 04:50:32 bart heartbeat: [7982]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Jun 23 04:50:33 bart heartbeat: [7982]: info: Local status now set to: 'up' Jun 23 04:50:33 bart heartbeat: [7982]: info: Link lisa.sn.stratus.com:eth0 up. Jun 23 04:50:33 bart heartbeat: [7982]: info: Status update for node lisa.sn.stratus.com: status active Jun 23 04:50:34 bart heartbeat: [7982]: info: Link lisa.sn.stratus.com:eth1 up. Jun 23 04:50:35 bart heartbeat: [7982]: info: Comm_now_up(): updating status to active Jun 23 04:50:36 bart heartbeat: [7982]: WARN: Late heartbeat: Node bart.sn.stratus.com: interval 3020 ms Jun 23 04:50:36 bart heartbeat: [7982]: info: Local status now set to: 'active' _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
