Hi list!
I'm using heartbeat on 2 active/active firewall systems; within the last
48 hours, coinciding with an uptime of 49 days and a few hours, all
servers have suffered the same problem: /var/log/heartbeat.log grows
until fills /var free space partition with messages like attached file.
From the first message, the other node take over all resources despite
of the original node isn't able to release it, at this point original
node don't works.
I watch at source code (include/clplumbing/longclock.h) that longclock_t
is at least defined as 64 bits variable, that seems to be enough. But I
think that on my servers is defined as 32 bits variable:
2^32 = 4294967296 / 1000 (miliseconds to seconds) = 4294967,296 / 3600
(seconds to hours) = 1193,046471111 / 24 (hours to days) = 49,71026963
days, like system's uptime.
What do you think? It's possible?
Additional Information:
Debian version: Sarge (3.1)
Vanilla kernel version: 2.4.34.5
Debian heartbeat version: 2.0.7-2
P.D: Sorry for my poor english skills./
/
--
Guillem Anguera
Administrador de Sistemas
Jazztel - DATAGRAMA
Tel: 900 80 83 80
Fax: +34 93 289 63 10
[EMAIL PROTECTED]: ganguera () datagrama ! net
http://www.jazztel.es
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Daily informational memory statistics
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 11/12254511 ms age 0 [pid24261/MST_CONTROL]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 729/386969605 103240/48586 [pid24261/MST_CONTROL]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 327624 total malloc bytes. pid [24261/MST_CONTROL]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/5 ms age -282384916 [pid24264/HBFIFO]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 348/493 43568/21103 [pid24264/HBFIFO]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 45396 total malloc bytes. pid [24264/HBFIFO]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24265/HBWRITE]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 350/2593550 43800/21267 [pid24265/HBWRITE]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 52740 total malloc bytes. pid [24265/HBWRITE]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24266/HBREAD]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 352/9774446 43968/21355 [pid24266/HBREAD]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 44592 total malloc bytes. pid [24266/HBREAD]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24267/HBWRITE]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 352/2593560 43968/21355 [pid24267/HBWRITE]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 52908 total malloc bytes. pid [24267/HBWRITE]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: MSG stats: 0/0 ms age -18800606 [pid24268/HBREAD]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: ha_malloc stats: 352/9774410 43968/21355 [pid24268/HBREAD]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: RealMalloc stats: 44592 total malloc bytes. pid [24268/HBREAD]
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: Current arena value: 0
Mar 25 11:09:41 fw02 heartbeat: [24261]: info: These are nothing to worry about.
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650912
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311431
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650913
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311432
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650914
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311433
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311434
Mar 25 16:23:00 fw02 logd: [24003]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 16650915
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: old value was 429496648, new value is 19, diff is 429496629, callcount 33311435
Mar 25 16:23:00 fw02 logd: [23999]: CRIT: time_longclock: clock_t from times(2) appears to have jumped backwards (in error)!
...
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems