On Thu, Oct 21, 2010 at 8:50 PM, Steve Shipway <[email protected]>wrote:
> The corrupted file ends up the correct size; however the entire file is > filled with zeroes (fortunately, we archive our RRD files nightly so I can > go back and retrieve the last uncorrupted version plus the corrupted > version) > Strange... The failure mode for rrd_open() unmaps and closes the file... that's about it. I'm not sure how it could zero the file like that. > > > The system is not (normally) memory or process-constrained; there is in > fact nothing to speak of running apart from apache and the rrdcached > daemon. The rrdinfo response is ‘not an RRD file’, since it doesn’t have > the RRD header. > > > > It has run fine for a whole week at these rates before the problem hit; so > that’s why I think it might be a leak in the RRD functions (which would of > course not show up in a non-daemon situation). We use the remote update, > info and (occasionally) create via the TCP socket; plus the info, last, > flush and fetch via the UNIX socket. > My workload is all UPDATE and FLUSH and I'm not seeing any problems. It's possible that the newer code (info, create) has a leak that I haven't caught yet in production. Could you show me: - the output of 'stats' from your daemon - "rrdtool info" from an RRD that's typical of your workload - the args you're using when starting the rrdcached daemon > The build is the absolute latest r2136 . > > > > The memory usage of the rrdcached process is definitely increasing; however > that may also be due to the number of items in the queue? It is currently > at 768m virtual, 560m physical (17% usage) which seems somewhat high to me, > even for 20,000+ RRD files. Eventually it will hit address-space limits > (this is a 32bit RHEL5 box with 4G physical memory) > My rrdcached runs around 2GB. That's with about 350k RRDs and 72 cached values per RRD. So, your memory utilization does look high. > Unfortunately I don’t have any of the nice developer tools for tracking > memory leaks… > You could install "valgrind" and run the daemon under that for a while. The daemon should be compiled with debugging symbols (-g) and not stripped in this case. i.e. % valgrind --leak-check=full --show-reachable=yes rrdcached -args blah blah blah Then, on exit it will show you what's leaking. Alternatively, if you can make a script that typifies your workload (perhaps at a smaller scale) that would help to reproduce the problem. -kb > > > Steve > > > ------------------------------ > > *Steve Shipway* > > ITS Unix Services Design Lead > > University of Auckland, New Zealand > > Floor 1, 58 Symonds Street, Auckland > > *Phone: +64 (0)9 3737599 ext 86487* > > *DDI: +64 (0)9 924 6487* > > *Mobile: +64 (0)21 753 189* > > *Email: [email protected]* > > P Please consider the environment before printing this e-mail > > * * > > > > *From:* kevin brintnall [mailto:[email protected]] > *Sent:* Friday, 22 October 2010 1:40 p.m. > *To:* Steve Shipway > *Cc:* [email protected]; [email protected] > *Subject:* Re: [rrd-developers] rrdcached use corrupting RRD files (trunk) > > > > Sebastian, > > > > I don't think the problem is specific to rrdcached; it uses normal librrd > API. This problem likely affects any RRD access in a memory constrained > system. > > > > Is there a lack of memory (or address space if 32-bit) on the system? Or > is it running up against per-process limits? > > > > How does the file end up? Is it the right size? What errors do you get > (i.e. when you "rrdtool info"). What architecture are you running on? > mmap() under failure conditions is likely to be OS-specific. > > > > What revision of trunk? > > > > Let us know what you find re: memory leak. > > > > -kb > > On Thu, Oct 21, 2010 at 5:07 PM, Steve Shipway <[email protected]> > wrote: > > I’ve had this happen too often now for it to be a fluke. OK, so I’m using > the trunk version of rrdtool 1.4, but (as far as I know) there is nothing in > there to modify the update code. We have a high update frequency – approx. > 20,000 MRTG targets at 5min intervals, which equates to about 70 updates per > second, and it took about a week for the problem to first hit. > > > > It seems that something is happening on update, possibly involving memory > allocation failure, that results in a corrupted file. > > > > I have some processes that may be reading the file without using the > rrdcached, but all updates are certainly going this way (no data collection > is run on this server any more, it all comes over TCP) > > > > Selected error logs show: > > listen_thread_main: pthread_create failed. > > queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed > with status -1. (mmaping file '/u01/rrdtool/maildelivery-mx1.rrd': Cannot > allocate memory) > > * (restarted rrdcached here)* > > replaying from journal: /u01/rrdtool/journal/rrd.journal.1285603416.766523 > > Replayed 61011 entries (0 failures) > > replaying from journal: /u01/rrdtool/journal/rrd.journal.1285607016.766153 > > Malformed journal entry at line 31024 > > Replayed 31023 entries (1 failures) > > journal processing complete > > queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed > with status -1. ('/u01/rrdtool/maildelivery-mx1.rrd' is not an RRD file) > > > > Although there was only one journal failure, there were in fact several RRD > files corrupted (I suspect the ones which were open at the time of the > memory failure?) and even more with the rrd_update_r memory allocation > failure. > > > > It seems that the memory ran out (memory leak?) and somewhere in the > rrd_update_r something was half-done. The resultant corrupted RRD file > doesn’t even load in rrdtool, seems the header is corrupt – I don’t (yet) > understand enough of the mmap code to work out what could be causing this. > I’m also trying to track the memory usage of the rrdcached process to see if > it is indeed growing due to a leak. > > > > I think there are two bugs here – first, the memory leak causing the > failure, and second, something in the code is not correctly handling a > memory allocation failure and corrupts the RRD file as a result. > > > > Has anyone else experienced this? And, more to the point, any RRD > developers who understand the MMAP update code want to take a look or give > some pointers? > > > > Steve > > > ------------------------------ > > *Steve Shipway* > > ITS Unix Services Design Lead > > University of Auckland, New Zealand > > Floor 1, 58 Symonds Street, Auckland > > *Phone: +64 (0)9 3737599 ext 86487* > > *DDI: +64 (0)9 924 6487* > > *Mobile: +64 (0)21 753 189* > > *Email: [email protected]* > > P Please consider the environment before printing this e-mail > > * * > > > > > _______________________________________________ > rrd-developers mailing list > [email protected] > https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers > > > > > -- > kevin brintnall =~ /[email protected]/ > -- kevin brintnall =~ /[email protected]/
_______________________________________________ rrd-users mailing list [email protected] https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
