HI all,
Some days ago we had an issue with heartbeat. It started when the active
node was rebooted.
The cluster contains DRBD, an IP and some services.
There is a bonded interfaces on both hosts and a serial cable between
the 2 hosts.
When the active machine was rebooted the resources are taken by the
other host.
At the time the rebooted machine was alive again heartbeat started to
log a huge amount of messages.
Does anyone knows what this has caused and how it can be solved.
At this moment I can't reproduce the problem.
Any help would be appreciated!
Thx,
Johan Huysmans
The full log can be found on:
http://www.raskas.be/heartbeat/heartbeat-issue.log.bz2
We are using heartbeat v2 with v1 config. This is the config:
logfacility local0
keepalive 2
deadtime 10
warntime 5
initdead 120
baud 19200
serial /dev/ttyUSB0
auto_failback off
udpport 696
ucast bond0 172.16.32.101
ucast bond0.2 172.16.40.101
node HOST-A
node HOST-B
hbgenmethod time
Here are some extracts from that log.
Feb 15 08:34:13 HOST-A heartbeat: [3138]: info: Local status now set to:
'up'
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Link HOST-B:/dev/ttyUSB0
up.
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Status update for node
HOST-B: status active
Feb 15 08:34:14 HOST-A harc[3174]: info: Running /etc/ha.d/rc.d/status
status
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Heartbeat restart on
node HOST-A
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Link HOST-A:/dev/ttyUSB0 up.
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Status update for node
HOST-A: status init
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Link HOST-A:bond0 up.
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Link HOST-A:bond0.2 up.
Feb 15 08:34:14 HOST-B heartbeat: [3077]: info: Status update for node
HOST-A: status up
Feb 15 08:34:14 HOST-B harc[20414]: info: Running /etc/ha.d/rc.d/status
status
Feb 15 08:34:14 HOST-B harc[20427]: info: Running /etc/ha.d/rc.d/status
status
Feb 15 08:34:14 HOST-A heartbeat: [3138]: WARN: 298 lost packet(s) for
[HOST-B] [137421:137720]
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Link HOST-B:bond0 up.
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Link HOST-B:bond0.2 up.
Feb 15 08:34:14 HOST-A heartbeat: [3138]: info: Comm_now_up(): updating
status to active
Feb 15 08:34:15 HOST-A heartbeat: [3138]: info: Local status now set to:
'active'
Feb 15 08:34:15 HOST-A heartbeat: [3138]: WARN: G_CH_dispatch_int:
Dispatch function for read child took too long to execute: 320 ms (> 50
ms) (GSource: 0x9938250)
Feb 15 08:34:15 HOST-A heartbeat: [3138]: ERROR: Cannot write to media
pipe 0: Resource temporarily unavailable
Feb 15 08:34:15 HOST-A heartbeat: [3138]: ERROR: Shutting down.
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: Status update for node
HOST-A: status active
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: all clients are now paused
Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137422 for HOST-A: seqno too low
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722,
lowseq=137620,ackseq=137621,lastmsg=121
Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137422 for HOST-A: seqno too low
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722,
lowseq=137620,ackseq=137621,lastmsg=121
Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137422 for HOST-A: seqno too low
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722,
lowseq=137620,ackseq=137621,lastmsg=121
Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137423 for HOST-A: seqno too low
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:15 HOST-A heartbeat: [3138]: ERROR: Cannot write to media
pipe 0: Resource temporarily unavailable
Feb 15 08:34:15 HOST-B heartbeat: [3077]: info: hiseq =137722,
lowseq=137620,ackseq=137621,lastmsg=121
Feb 15 08:34:15 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137423 for HOST-A: seqno too low
Feb 15 08:34:16 HOST-A heartbeat: [3138]: ERROR: Cannot write to media
pipe 0: Resource temporarily unavailable
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hiseq =137723,
lowseq=137620,ackseq=137621,lastmsg=122
Feb 15 08:34:16 HOST-A heartbeat: [3138]: ERROR: Shutting down.
Feb 15 08:34:16 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137431 for HOST-A: seqno too low
Feb 15 08:34:16 HOST-A heartbeat: [3138]: WARN: Shutdown delayed until
current resource activity finishes.
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request took too long to execute: 1240
ms (> 1000 ms) (GSource: 0x995dd30)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request was delayed 1310 ms (> 1000 ms)
before being called (GSource: 0x995ddc8)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hiseq =137723,
lowseq=137620,ackseq=137621,lastmsg=122
Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch:
started at 429408104 should have started at 429407973
Feb 15 08:34:16 HOST-B heartbeat: [3077]: WARN: G_CH_dispatch_int:
Dispatch function for read child took too long to execute: 210 ms (> 50
ms) (GSource: 0x910a008)
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request was delayed 1310 ms (> 1000 ms)
before being called (GSource: 0x995de60)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137431 for HOST-A: seqno too low
Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch:
started at 429408104 should have started at 429407973
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request was delayed 1330 ms (> 1000 ms)
before being called (GSource: 0x995def8)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch:
started at 429408106 should have started at 429407973
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hiseq =137723,
lowseq=137620,ackseq=137621,lastmsg=122
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request was delayed 1370 ms (> 1000 ms)
before being called (GSource: 0x995df90)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: WARN: G_CH_dispatch_int:
Dispatch function for read child took too long to execute: 220 ms (> 50
ms) (GSource: 0x910a138)
Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch:
started at 429408110 should have started at 429407973
Feb 15 08:34:16 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137432 for HOST-A: seqno too low
Feb 15 08:34:17 HOST-A heartbeat: [3138]: WARN: Gmain_timeout_dispatch:
Dispatch function for retransmit request was delayed 1440 ms (> 1000 ms)
before being called (GSource: 0x995e028)
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: fromnode =HOST-A,
fromnode's ackseq = 0
Feb 15 08:34:17 HOST-A heartbeat: [3138]: info: Gmain_timeout_dispatch:
started at 429408117 should have started at 429407973
Feb 15 08:34:16 HOST-B heartbeat: [3077]: info: hist information:
Feb 15 08:34:17 HOST-B heartbeat: [3077]: info: hiseq =137723,
lowseq=137620,ackseq=137621,lastmsg=122
Feb 15 08:34:17 HOST-B heartbeat: [3077]: ERROR: Cannot rexmit pkt
137432 for HOST-A: seqno too low
Feb 15 08:34:51 HOST-B heartbeat: [3077]: ERROR: Message hist queue is
filling up (200 messages in queue)
Feb 15 08:34:51 HOST-B heartbeat: [3077]: ERROR: Cannot write to media
pipe 0: Resource temporarily unavailable
Feb 15 08:34:51 HOST-B heartbeat: [3077]: ERROR: Shutting down.
Feb 15 08:34:52 HOST-B heartbeat: [3077]: ERROR: Message hist queue is
filling up (200 messages in queue)
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost
packet: node HOST-B seq 137430
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost
packet: node HOST-B seq 137437
Feb 15 08:34:52 HOST-A last message repeated 3 times
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost
packet: node HOST-B seq 137430
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost
packet: node HOST-B seq 137437
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost
packet: node HOST-B seq 137437
Feb 15 08:34:52 HOST-A heartbeat: [3138]: ERROR: Irretrievably lost
packet: node HOST-B seq 137430
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems