Hearbeat has been pegging the CPU on the primary DRBD cluster for hours now...I see some timeout errors in the logs but nothing else to indicate why the heartbeat process is consuming so many cpu cycles. It memory size is significantly larger than similar systems, usually at 13mb only using 0.4cpu.
Can anyone share some tips as to where I might look for probable cause? I'm sharing as much detail as possible on the current setup. SNSfile01:/var/log # top -b -d 2 -n 2 -p 3358 top - 16:00:33 up 2 days, 7:46, 2 users, load average: 7.59, 7.54, 7.53 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 30.2%us, 0.3%sy, 0.0%ni, 68.8%id, 0.5%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 3962268k total, 1287412k used, 2674856k free, 141692k buffers Swap: 4192956k total, 0k used, 4192956k free, 982732k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3358 root -2 0 99.0m 98m 5384 R 95.9 2.6 1027:44 heartbeat top - 16:00:35 up 2 days, 7:46, 2 users, load average: 8.02, 7.63, 7.56 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 93.1%us, 0.5%sy, 0.0%ni, 6.0%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3962268k total, 1287412k used, 2674856k free, 141692k buffers Swap: 4192956k total, 0k used, 4192956k free, 982732k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3358 root -2 0 99.0m 98m 5384 S 93.1 2.6 1027:46 heartbeat gstack 3358 #0 0xb7dc99cb in g_main_context_prepare () from /usr/lib/libglib-2.0.so.0 #1 0xb7dc9dca in ?? () from /usr/lib/libglib-2.0.so.0 #2 0xb7dca602 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0 #3 0x08056ae8 in ?? () #4 0x0805a247 in main () SNSfile01:/var/log # ps aux|grep -i heart root 3358 30.9 2.5 101888 101884 ? RLs Aug28 1038:38 heartbeat: master control process nobody 3367 0.0 0.1 6720 6716 ? SL Aug28 0:04 heartbeat: FIFO reader nobody 3368 0.0 0.1 6716 6712 ? RL Aug28 1:41 heartbeat: write: bcast eth3 nobody 3369 0.0 0.1 6716 6712 ? SL Aug28 0:24 heartbeat: read: bcast eth3 SNSfile01:/var/log # more /etc/ha.d/ha.cf /etc/ha.d/haresources :::::::::::::: /etc/ha.d/ha.cf :::::::::::::: logfile /var/log/ha-log debugfile /var/log/ha-debug bcast eth3 udpport 694 warntime 8 deadtime 30 initdead 120 keepalive 2 auto_failback on node SNSfile01 node SNSfile02 :::::::::::::: /etc/ha.d/haresources :::::::::::::: SNSfile01 IPaddr::10.10.1.180/24 drbddisk::r0 Filesystem::/dev/drbd0::/wwwroot::reiserfs nfsserver smb n mb SNSfile01:/var/log # procinfo Linux 2.6.27.7-9-pae (ge...@buildhost) (gcc 4.3.2) #1 SMP 2008-12-04 18:10:04 +0100 1CPU [SNSfile01.] Memory: Total Used Free Shared Buffers Cached Mem: 3962268 1287900 2674368 0 141696 996976 Swap: 4192956 0 4192956 Bootup: Sat Aug 28 08:14:18 2010 Load average: 8.90 7.90 7.65 11/167 23585 user : 16:48:33.65 30.1% page in : 608752 disk 1: 12118r 347678w nice : 0:00:19.51 0.0% page out: 7255902 disk 2: 29748r 240080w system: 0:10:29.42 0.3% page act: 136932 IOwait: 0:17:25.57 0.5% page dea: 0 hw irq: 0:00:39.38 0.0% page flt: 43102070 sw irq: 0:03:29.61 0.1% swap in : 0 idle : 1d 14:18:20.75 68.7% swap out: 0 uptime: 2d 7:47:14.95 context : 118992651 irq 0: 75 timer irq 12: 92 i8042 irq 1: 8 i8042 irq 14: 175143 ata_piix irq 3: 1 irq 15: 0 ata_piix irq 4: 1 irq 16: 0 vmci irq 6: 5 floppy [2] irq 17: 429639 ioc0 irq 7: 0 parport0 irq 18: 74429101 vmxnet ether irq 8: 0 rtc0 irq 19: 1158878 vmxnet ether irq 9: 0 acpi SNSfile01:/var/log # tail ha-log ha-debug ==> ha-log <== heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5748) heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf57b0) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0xddf5a98) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5b00) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5b68) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5bd0) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5c38) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0xddf5ca0) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 60 ms (> 10 ms) (GSource: 0xddf5d08) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5d70) ==> ha-debug <== heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5748) heartbeat[3358]: 2010/08/30_16:09:31 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf57b0) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0xddf5a98) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5b00) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5b68) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5bd0) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5c38) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0xddf5ca0) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 60 ms (> 10 ms) (GSource: 0xddf5d08) heartbeat[3358]: 2010/08/30_16:09:32 WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 70 ms (> 10 ms) (GSource: 0xddf5d70) SNSfile01:/var/log # zypper info heartbeat Loading repository data... Reading installed packages... Information for package heartbeat: Repository: @System Name: heartbeat Version: 2.99.3-1.6 Arch: i586 Vendor: openSUSE Installed: Yes Status: up-to-date Installed Size: 1.0 M Summary: The Heartbeat Subsystem for High-Availability Linux Description: heartbeat is a sophisticated multinode resource manager for High Availability clusters. It can failover arbitrary resources, ranging from IP addresses over NFS to databases that are tied in via resource scripts. The resources can have arbitrary dependencies for ordering or placement between them. heartbeat contains a cluster membership layer, fencing, and local and clusterwide resource management functionality. 1.2/1.0 based 2-node only configurations are supported in a legacy mode. heartbeat implements the following kinds of heartbeats: - Serial ports - UDP/IPv4 broadcast, multi-cast, and unicast - IPv4 "ping" pseudo-cluster members. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
