HI all,

Some days ago we had an issue with heartbeat. It started when the active node was rebooted.

The cluster contains DRBD, an IP and some services.
There is a bonded interfaces on both hosts and a serial cable between the 2 hosts.

When the active machine was rebooted the resources are taken by the other host. At the time the rebooted machine was alive again heartbeat started to log a huge amount of messages.


Does anyone knows what this has caused and how it can be solved.
At this moment I can't reproduce the problem.

Any help would be appreciated!
Thx,
Johan Huysmans



The full log can be found on: http://www.raskas.be/heartbeat/heartbeat-issue.log.bz2

We are using heartbeat v2 with v1 config. This is the config:
logfacility     local0
keepalive 2
deadtime 10
warntime 5
initdead 120

baud 19200
serial /dev/ttyUSB0

auto_failback off
udpport 696

ucast  bond0 172.16.32.101
ucast  bond0.2 172.16.40.101

node HOST-A
node HOST-B

hbgenmethod time


Here are some extracts from that log.

Feb 15 08:34:13 HOST-A heartbeat: [3138]: info: Local status now set to: 'up' Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Link HOST-B:/dev/ttyUSB0 up. Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Status update for node HOST-B: status active Feb 15 08:34:14 HOST-A harc[3174]: info: Running /etc/ha.d/rc.d/status status Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Heartbeat restart on node HOST-A
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Link HOST-A:/dev/ttyUSB0 up.
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Status update for node HOST-A: status init
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Link HOST-A:bond0 up.
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Link HOST-A:bond0.2 up.
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Status update for node HOST-A: status up Feb 15 08:34:14 HOST-B harc[20414]: info: Running /etc/ha.d/rc.d/status status Feb 15 08:34:14 HOST-B harc[20427]: info: Running /etc/ha.d/rc.d/status status Feb 15 08:34:14 HOST-A heartbeat: [3138]: WARN: 298 lost packet(s) for [HOST-B] [137421:137720]
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Link HOST-B:bond0 up.
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Link HOST-B:bond0.2 up.
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Comm_now_up(): updating status to active Feb 15 08:34:15 HOST-A heartbeat: [3138]: info: Local status now set to: 'active' Feb 15 08:34:15 HOST-A heartbeat: [3138]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 320 ms (> 50 ms) (GSource: 0x9938250)

Feb 15 08:34:15 HOST-A heartbeat: [3138]: ERROR: Cannot write to media pipe 0: Resource temporarily unavailable
Feb 15 08:34:15 HOST-A heartbeat: [3138]: ERROR: Shutting down.
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: Status update for node HOST-A: status active
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: all clients are now paused
Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137422 for HOST-A: seqno too low Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722, lowseq=137620,ackseq=137621,lastmsg=121 Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137422 for HOST-A: seqno too low Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722, lowseq=137620,ackseq=137621,lastmsg=121 Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137422 for HOST-A: seqno too low Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722, lowseq=137620,ackseq=137621,lastmsg=121 Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137423 for HOST-A: seqno too low Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-A heartbeat: [3138]: ERROR: Cannot write to media pipe 0: Resource temporarily unavailable Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722, lowseq=137620,ackseq=137621,lastmsg=121 Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137423 for HOST-A: seqno too low

Feb 15 08:34:16 HOST-A heartbeat: [3138]: ERROR: Cannot write to media pipe 0: Resource temporarily unavailable Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hiseq =137723, lowseq=137620,ackseq=137621,lastmsg=122
Feb 15 08:34:16 HOST-A heartbeat: [3138]: ERROR: Shutting down.
Feb 15 08:34:16 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137431 for HOST-A: seqno too low Feb 15 08:34:16 HOST-A heartbeat: [3138]: WARN: Shutdown delayed until current resource activity finishes. Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0 Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 1240 ms (> 1000 ms) (GSource: 0x995dd30)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request was delayed 1310 ms (> 1000 ms) before being called (GSource: 0x995ddc8) Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hiseq =137723, lowseq=137620,ackseq=137621,lastmsg=122 Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch: started at 429408104 should have started at 429407973 Feb 15 08:34:16 HOST-B heartbeat: [3077]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 210 ms (> 50 ms) (GSource: 0x910a008) Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request was delayed 1310 ms (> 1000 ms) before being called (GSource: 0x995de60) Feb 15 08:34:16 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137431 for HOST-A: seqno too low Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch: started at 429408104 should have started at 429407973 Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0 Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request was delayed 1330 ms (> 1000 ms) before being called (GSource: 0x995def8)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch: started at 429408106 should have started at 429407973 Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hiseq =137723, lowseq=137620,ackseq=137621,lastmsg=122 Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request was delayed 1370 ms (> 1000 ms) before being called (GSource: 0x995df90) Feb 15 08:34:16 HOST-B heartbeat: [3077]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 220 ms (> 50 ms) (GSource: 0x910a138) Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch: started at 429408110 should have started at 429407973 Feb 15 08:34:16 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137432 for HOST-A: seqno too low Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request was delayed 1440 ms (> 1000 ms) before being called (GSource: 0x995e028) Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: fromnode =HOST-A, fromnode's ackseq = 0 Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch: started at 429408117 should have started at 429407973
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:17 HOST-B heartbeat: [3077]: info: hiseq =137723, lowseq=137620,ackseq=137621,lastmsg=122 Feb 15 08:34:17 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt 137432 for HOST-A: seqno too low

Feb 15 08:34:51 HOST-B heartbeat: [3077]: ERROR: Message hist queue is filling up (200 messages in queue) Feb 15 08:34:51 HOST-B heartbeat: [3077]: ERROR: Cannot write to media pipe 0: Resource temporarily unavailable
Feb 15 08:34:51 HOST-B heartbeat: [3077]: ERROR: Shutting down.
Feb 15 08:34:52 HOST-B heartbeat: [3077]: ERROR: Message hist queue is filling up (200 messages in queue) Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost packet: node HOST-B seq 137430 Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost packet: node HOST-B seq 137437
Feb 15 08:34:52 HOST-A last message repeated 3 times
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost packet: node HOST-B seq 137430 Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost packet: node HOST-B seq 137437 Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost packet: node HOST-B seq 137437 Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost packet: node HOST-B seq 137430

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to