[rrd-users] rrdcached notion of last_update_time out of sync with rrd file

Steve Jankowski Tue, 11 Oct 2011 16:17:01 -0700

I'm encountering situations where rrdcached's notion of the last_update_time is 
out of sync with what's actually in the rrd file.  This is leading to the loss 
of hours worth of data.  I have a rough idea of what triggers the problem, but 
do not yet have steps to reproduce.  I'll describe my usage and hopefully this 
will trigger some suspicions.

Environment: CentOS 5.4, rrdtool 1.4.3 / 1.4.5
(problem encountered with rrdtool 1.4.3. During debugging, I switched to
1.4.5 rrdcached using the journals created by 1.4.3)
Usage: All updates via rrdcached, rrd_fetch against subset of rrds running once
a minute
Size: ~7800 rrds, each updated once a minute. Updates applied within one
second.

My stats processing application sucks up a minute's worth of application stats
each minute, converts them into "update" commands and writes them directly to
rrdcached via unix socket. This app holds a long-lived connection to the
rrdcached. A separate monitoring application scrapes stats from a subset of
the rrds once a minute (using librrd) with the rrdcached daemon address
configured (ie. it sends a flush to rrdcached prior to fetching).

This works great unless the stat processing application dies. The minutely
application stats are buffered up and processed at high-speed when the stats
processing app is restarted. (currently a manual restart, so there could be
hours worth of stats to crunch). While catching up, in excess of 10,000
updates/sec are sent to rrdcached.

The updates appear to accepted by rrdcached (response to each update is "0
errors, enqueued 1 value(s)."). However, a large number of the rrds have no
data for the whole catching up time while some have all the data. Rrd info on
the problem files shows this (even after flush and rrdcached restart):

filename = "acme.rrd"
rrd_version = "0003"
step = 60
last_update = 1318330860
header_size = 8248

However, manually adding the next time results in this error:

update /var/acme/stats/rrd/customer/acme/acme_v1.rrd
1318330920:2741387100176:2744313219528:2344931134016:2321908650392:2366106587368:2384946112584:17:73:753673980704:771235491136:357218014544:348830922000:1743477:142642:36463:213:208:5:156248210:1312447
-1 illegal attempt to update using time 1318330920 when last update time is
1318334820 (minimum one second step)

Why does rrdcached think the last update time is 1318334820 when the file
is 1318330860 ?

Restarting rrdcached has no effect. The only work-around I found was to
shutdown all rrd-using apps, restart rrdcached (so journals get
replayed/flushed), stop rrdcached, delete journals and restart rrdcached. Now
the manual "update" command succeeds.

The only errors logged by rrdcached are of the " illegal attempt to update
using time XXXX" variety.

Theory: Under some condition(s) rrdcached detects (incorrectly?) an error in
an update, but still advances it's internal notion of the last_update_time.
This condition carries forward causing each subsequent update to get an error
and push the last_update_time forward. However, the internal last_update_time
never gets written to the rrd file (which is a good thing, otherwise I would
not be able to recover). This condition persists across journal replays.

If you're still with me, any thoughts?

Steve

_______________________________________________
rrd-users mailing list
[email protected]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users

[rrd-users] rrdcached notion of last_update_time out of sync with rrd file

Reply via email to