Re: [rrd-users] [rrd-developers] rrdcached use corrupting RRD files (trunk)

Thorsten von Eicken Wed, 03 Nov 2010 10:53:29 -0700

Sadly interesting...
As a separate data point, we're running over 100 rrdcached servers, each handling >30k tree nodes and receiving about 3k updates/sec, caching data for ~1 hour so updating files at ~20 updates/sec. Uptime in months without problem, never seen corruption (knock on wood). We're running 1.4 trunk revision r2092 (randomly picked) on Ubuntu 8.04 (used to run on CentOS 5.2, I believe). We're not seeing any memory leak and running stable at 800-900MB virtual / 500-600MB rss. We're using TCP sockets and doing updates, fetches and flushes. The command line we use is:
/usr/bin/rrdcached -w 3600 -z 3600 -f 7200 -t 2 -a 128 -b /rrds/hosts -B -j /rrds/journal -p /var/run/rrdcached/rrdcached.pid -l 10.x.x.x:xxxx
I'm not writing this to contradict you, I'm just wondering what could be different in your set-up that causes the problems. (Oh, that reminds me that the -a 128 made a huge difference for us around memory allocation performance.)
Good luck!
TvE

On 10/21/2010 6:50 PM, Steve Shipway wrote:

The corrupted file ends up the correct size; however the entire file is filled with zeroes (fortunately, we archive our RRD files nightly so I can go back and retrieve the last uncorrupted version plus the corrupted version)

The system is not (normally) memory or process-constrained; there is in fact nothing to speak of running apart from apache and the rrdcached daemon. The rrdinfo response is ‘not an RRD file’, since it doesn’t have the RRD header.

It has run fine for a whole week at these rates before the problem hit; so that’s why I think it might be a leak in the RRD functions (which would of course not show up in a non-daemon situation). We use the remote update, info and (occasionally) create via the TCP socket; plus the info, last, flush and fetch via the UNIX socket.

The build is the absolute latest r2136 .

The memory usage of the rrdcached process is definitely increasing; however that may also be due to the number of items in the queue? It is currently at 768m virtual, 560m physical (17% usage) which seems somewhat high to me, even for 20,000+ RRD files. Eventually it will hit address-space limits (this is a 32bit RHEL5 box with 4G physical memory)

Unfortunately I don’t have any of the nice developer tools for tracking memory leaks…

Steve

Steve Shipway

ITS Unix Services Design Lead

University of Auckland, New Zealand

Floor 1, 58 Symonds Street, Auckland

Phone: +64 (0)9 3737599 ext 86487

DDI: +64 (0)9 924 6487

Mobile: +64 (0)21 753 189

Email: [email protected]

P Please consider the environment before printing this e-mail

From: kevin brintnall [mailto:[email protected]]
Sent: Friday, 22 October 2010 1:40 p.m.
To: Steve Shipway
Cc: [email protected]; [email protected]
Subject: Re: [rrd-developers] rrdcached use corrupting RRD files (trunk)

Sebastian,

I don't think the problem is specific to rrdcached; it uses normal librrd API. This problem likely affects any RRD access in a memory constrained system.

Is there a lack of memory (or address space if 32-bit) on the system? Or is it running up against per-process limits?

How does the file end up? Is it the right size? What errors do you get (i.e. when you "rrdtool info"). What architecture are you running on? mmap() under failure conditions is likely to be OS-specific.

What revision of trunk?

Let us know what you find re: memory leak.

-kb

On Thu, Oct 21, 2010 at 5:07 PM, Steve Shipway <[email protected]> wrote:

I’ve had this happen too often now for it to be a fluke. OK, so I’m using the trunk version of rrdtool 1.4, but (as far as I know) there is nothing in there to modify the update code. We have a high update frequency – approx. 20,000 MRTG targets at 5min intervals, which equates to about 70 updates per second, and it took about a week for the problem to first hit.

It seems that something is happening on update, possibly involving memory allocation failure, that results in a corrupted file.

I have some processes that may be reading the file without using the rrdcached, but all updates are certainly going this way (no data collection is run on this server any more, it all comes over TCP)

Selected error logs show:

listen_thread_main: pthread_create failed.

queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed with status -1. (mmaping file '/u01/rrdtool/maildelivery-mx1.rrd': Cannot allocate memory)

(restarted rrdcached here)

replaying from journal: /u01/rrdtool/journal/rrd.journal.1285603416.766523

Replayed 61011 entries (0 failures)

replaying from journal: /u01/rrdtool/journal/rrd.journal.1285607016.766153

Malformed journal entry at line 31024

Replayed 31023 entries (1 failures)

journal processing complete

queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed with status -1. ('/u01/rrdtool/maildelivery-mx1.rrd' is not an RRD file)

Although there was only one journal failure, there were in fact several RRD files corrupted (I suspect the ones which were open at the time of the memory failure?) and even more with the rrd_update_r memory allocation failure.

It seems that the memory ran out (memory leak?) and somewhere in the rrd_update_r something was half-done. The resultant corrupted RRD file doesn’t even load in rrdtool, seems the header is corrupt – I don’t (yet) understand enough of the mmap code to work out what could be causing this. I’m also trying to track the memory usage of the rrdcached process to see if it is indeed growing due to a leak.

I think there are two bugs here – first, the memory leak causing the failure, and second, something in the code is not correctly handling a memory allocation failure and corrupts the RRD file as a result.

Has anyone else experienced this? And, more to the point, any RRD developers who understand the MMAP update code want to take a look or give some pointers?

Steve

Steve Shipway

ITS Unix Services Design Lead

University of Auckland, New Zealand

Floor 1, 58 Symonds Street, Auckland

Phone: +64 (0)9 3737599 ext 86487

DDI: +64 (0)9 924 6487

Mobile: +64 (0)21 753 189

Email: [email protected]

P Please consider the environment before printing this e-mail

_______________________________________________
rrd-developers mailing list
[email protected]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers

--
kevin brintnall =~ /[email protected]/
_______________________________________________
rrd-developers mailing list
[email protected]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers

_______________________________________________
rrd-users mailing list
[email protected]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users

Re: [rrd-users] [rrd-developers] rrdcached use corrupting RRD files (trunk)

Reply via email to