Hearbeat has been pegging the CPU on the primary DRBD cluster for hours
now...I see some timeout errors in the logs but nothing else to indicate why
the heartbeat process is consuming so many cpu cycles. It memory size is
significantly larger than similar systems, usually at  13mb only using
0.4cpu.

Can anyone share some tips as to where I might look for probable cause? I'm
sharing as much detail as possible on the current setup.

SNSfile01:/var/log # top -b -d 2 -n 2 -p 3358
top - 16:00:33 up 2 days,  7:46,  2 users,  load average: 7.59, 7.54, 7.53
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 30.2%us,  0.3%sy,  0.0%ni, 68.8%id,  0.5%wa,  0.0%hi,  0.1%si,
0.0%st
Mem:   3962268k total,  1287412k used,  2674856k free,   141692k buffers
Swap:  4192956k total,        0k used,  4192956k free,   982732k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3358 root      -2   0 99.0m  98m 5384 R 95.9  2.6   1027:44 heartbeat


top - 16:00:35 up 2 days,  7:46,  2 users,  load average: 8.02, 7.63, 7.56
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s): 93.1%us,  0.5%sy,  0.0%ni,  6.0%id,  0.5%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   3962268k total,  1287412k used,  2674856k free,   141692k buffers
Swap:  4192956k total,        0k used,  4192956k free,   982732k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3358 root      -2   0 99.0m  98m 5384 S 93.1  2.6   1027:46 heartbeat

gstack 3358
#0  0xb7dc99cb in g_main_context_prepare () from /usr/lib/libglib-2.0.so.0
#1  0xb7dc9dca in ?? () from /usr/lib/libglib-2.0.so.0
#2  0xb7dca602 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#3  0x08056ae8 in ?? ()
#4  0x0805a247 in main ()

SNSfile01:/var/log # ps aux|grep -i heart
root      3358 30.9  2.5 101888 101884 ?       RLs  Aug28 1038:38 heartbeat:
master control process
nobody    3367  0.0  0.1   6720  6716 ?        SL   Aug28   0:04 heartbeat:
FIFO reader
nobody    3368  0.0  0.1   6716  6712 ?        RL   Aug28   1:41 heartbeat:
write: bcast eth3
nobody    3369  0.0  0.1   6716  6712 ?        SL   Aug28   0:24 heartbeat:
read: bcast eth3


SNSfile01:/var/log # more /etc/ha.d/ha.cf /etc/ha.d/haresources
::::::::::::::
/etc/ha.d/ha.cf
::::::::::::::
logfile /var/log/ha-log
debugfile /var/log/ha-debug
bcast   eth3
udpport 694
warntime 8
deadtime 30
initdead 120
keepalive 2
auto_failback on
node SNSfile01
node SNSfile02
::::::::::::::
/etc/ha.d/haresources
::::::::::::::
SNSfile01 IPaddr::10.10.1.180/24 drbddisk::r0
Filesystem::/dev/drbd0::/wwwroot::reiserfs nfsserver smb n
mb

SNSfile01:/var/log # procinfo
Linux 2.6.27.7-9-pae (ge...@buildhost) (gcc 4.3.2) #1 SMP 2008-12-04
18:10:04 +0100 1CPU [SNSfile01.]

Memory:      Total        Used        Free      Shared     Buffers
Cached
Mem:       3962268     1287900     2674368           0      141696
996976
Swap:      4192956           0     4192956

Bootup: Sat Aug 28 08:14:18 2010    Load average: 8.90 7.90 7.65 11/167
23585

user  :      16:48:33.65  30.1%  page in :     608752  disk 1:    12118r
347678w
nice  :       0:00:19.51   0.0%  page out:    7255902  disk 2:    29748r
240080w
system:       0:10:29.42   0.3%  page act:     136932
IOwait:       0:17:25.57   0.5%  page dea:          0
hw irq:       0:00:39.38   0.0%  page flt:   43102070
sw irq:       0:03:29.61   0.1%  swap in :          0
idle  :   1d 14:18:20.75  68.7%  swap out:          0
uptime:   2d  7:47:14.95         context :  118992651

irq  0:        75 timer                 irq 12:        92 i8042
irq  1:         8 i8042                 irq 14:    175143 ata_piix
irq  3:         1                       irq 15:         0 ata_piix
irq  4:         1                       irq 16:         0 vmci
irq  6:         5 floppy [2]            irq 17:    429639 ioc0
irq  7:         0 parport0              irq 18:  74429101 vmxnet ether
irq  8:         0 rtc0                  irq 19:   1158878 vmxnet ether
irq  9:         0 acpi

SNSfile01:/var/log # tail  ha-log ha-debug
==> ha-log <==
heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5748)
heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf57b0)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 100 ms (> 10 ms)
(GSource: 0xddf5a98)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5b00)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5b68)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5bd0)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5c38)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 80 ms (> 10 ms)
(GSource: 0xddf5ca0)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 60 ms (> 10 ms)
(GSource: 0xddf5d08)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5d70)

==> ha-debug <==
heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5748)
heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf57b0)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 100 ms (> 10 ms)
(GSource: 0xddf5a98)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5b00)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5b68)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5bd0)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5c38)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 80 ms (> 10 ms)
(GSource: 0xddf5ca0)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 60 ms (> 10 ms)
(GSource: 0xddf5d08)
heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 70 ms (> 10 ms)
(GSource: 0xddf5d70)

SNSfile01:/var/log # zypper info heartbeat
Loading repository data...
Reading installed packages...

Information for package heartbeat:

Repository: @System
Name: heartbeat
Version: 2.99.3-1.6
Arch: i586
Vendor: openSUSE
Installed: Yes
Status: up-to-date
Installed Size: 1.0 M
Summary: The Heartbeat Subsystem for High-Availability Linux
Description:
heartbeat is a sophisticated multinode resource manager for High
Availability clusters.

It can failover arbitrary resources, ranging from IP addresses over NFS
to databases that are tied in via resource scripts. The resources can
have arbitrary dependencies for ordering or placement between them.

heartbeat contains a cluster membership layer, fencing, and local and
clusterwide resource management functionality.

1.2/1.0 based 2-node only configurations are supported in a legacy
mode.

heartbeat implements the following kinds of heartbeats:

- Serial ports

- UDP/IPv4 broadcast, multi-cast, and unicast

- IPv4 "ping" pseudo-cluster members.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to