Hi list! I asked this on main mailing list, but nobody seems to know it...
I'm using heartbeat on 2 active/active firewall systems; within the last 48 hours, coinciding with an uptime of 49 days and a few hours, all servers have suffered the same problem: /var/log/heartbeat.log grows until fills /var free space partition with messages like attached file. From the first message, the other node take over all resources despite of the original node isn't able to release it, at this point original node doesn't works.
I watch at source code (include/clplumbing/longclock.h) that longclock_t is at least defined as 64 bits variable, that seems to be enough. But I think that on my servers is defined as 32 bits variable:
2^32 = 4294967296 / 1000 (miliseconds to seconds) = 4294967,296 / 3600 (seconds to hours) = 1193,046471111 / 24 (hours to days) = 49,71026963 days, like system's uptime.
What do you think? Is that possible? Additional Information: Debian version: Sarge (3.1) Vanilla kernel version: 2.4.34.5 Debian heartbeat version: 2.0.7-2 P.D: Sorry for my poor english skills. -- Guillem Anguera Administrador de Sistemas Jazztel - DATAGRAMA Tel: 900 80 83 80 Fax: +34 93 289 63 10 [EMAIL PROTECTED]: ganguera () datagrama ! net http://www.jazztel.es
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Daily informational memory statistics Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 11/12254511 ms age 0 [pid24261/MST_CONTROL] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 729/386969605 103240/48586 [pid24261/MST_CONTROL] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 327624 total malloc bytes. pid [24261/MST_CONTROL] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0 Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/5 ms age -282384916 [pid24264/HBFIFO] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 348/493 43568/21103 [pid24264/HBFIFO] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 45396 total malloc bytes. pid [24264/HBFIFO] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0 Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24265/HBWRITE] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 350/2593550 43800/21267 [pid24265/HBWRITE] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 52740 total malloc bytes. pid [24265/HBWRITE] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0 Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24266/HBREAD] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 352/9774446 43968/21355 [pid24266/HBREAD] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 44592 total malloc bytes. pid [24266/HBREAD] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0 Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24267/HBWRITE] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 352/2593560 43968/21355 [pid24267/HBWRITE] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 52908 total malloc bytes. pid [24267/HBWRITE] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0 Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24268/HBREAD] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 352/9774410 43968/21355 [pid24268/HBREAD] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 44592 total malloc bytes. pid [24268/HBREAD] Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0 Mar 25 11:09:41 fw02 heartbeat: [24261]: info: These are nothing to worry about. Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650912 Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311431 Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650913 Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311432 Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650914 Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311433 Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311434 Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650915 Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311435 Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)! ...
_______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
