I've had this happen too often now for it to be a fluke. OK, so I'm using the
trunk version of rrdtool 1.4, but (as far as I know) there is nothing in there
to modify the update code. We have a high update frequency - approx. 20,000
MRTG targets at 5min intervals, which equates to about 70 updates per second,
and it took about a week for the problem to first hit.
It seems that something is happening on update, possibly involving memory
allocation failure, that results in a corrupted file.
I have some processes that may be reading the file without using the rrdcached,
but all updates are certainly going this way (no data collection is run on this
server any more, it all comes over TCP)
Selected error logs show:
listen_thread_main: pthread_create failed.
queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed with
status -1. (mmaping file '/u01/rrdtool/maildelivery-mx1.rrd': Cannot allocate
memory)
(restarted rrdcached here)
replaying from journal: /u01/rrdtool/journal/rrd.journal.1285603416.766523
Replayed 61011 entries (0 failures)
replaying from journal: /u01/rrdtool/journal/rrd.journal.1285607016.766153
Malformed journal entry at line 31024
Replayed 31023 entries (1 failures)
journal processing complete
queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed with
status -1. ('/u01/rrdtool/maildelivery-mx1.rrd' is not an RRD file)
Although there was only one journal failure, there were in fact several RRD
files corrupted (I suspect the ones which were open at the time of the memory
failure?) and even more with the rrd_update_r memory allocation failure.
It seems that the memory ran out (memory leak?) and somewhere in the
rrd_update_r something was half-done. The resultant corrupted RRD file doesn't
even load in rrdtool, seems the header is corrupt - I don't (yet) understand
enough of the mmap code to work out what could be causing this. I'm also
trying to track the memory usage of the rrdcached process to see if it is
indeed growing due to a leak.
I think there are two bugs here - first, the memory leak causing the failure,
and second, something in the code is not correctly handling a memory allocation
failure and corrupts the RRD file as a result.
Has anyone else experienced this? And, more to the point, any RRD developers
who understand the MMAP update code want to take a look or give some pointers?
Steve
________________________________
Steve Shipway
ITS Unix Services Design Lead
University of Auckland, New Zealand
Floor 1, 58 Symonds Street, Auckland
Phone: +64 (0)9 3737599 ext 86487
DDI: +64 (0)9 924 6487
Mobile: +64 (0)21 753 189
Email: [email protected]<mailto:[email protected]>
P Please consider the environment before printing this e-mail
_______________________________________________
rrd-users mailing list
[email protected]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users