I'm encountering situations where rrdcached's notion of the last_update_time is 
out of sync with what's actually in the rrd file.  This is leading to the loss 
of hours worth of data.  I have a rough idea of what triggers the problem, but 
do not yet have steps to reproduce.  I'll describe my usage and hopefully this 
will trigger some suspicions.

Environment: CentOS 5.4, rrdtool 1.4.3 / 1.4.5
   (problem encountered with rrdtool 1.4.3.   During debugging, I switched to 
1.4.5 rrdcached using the journals created by 1.4.3)
Usage: All updates via rrdcached, rrd_fetch against subset of rrds running once 
a minute
Size: ~7800 rrds, each updated once a minute.  Updates applied within one 
second.

My stats processing application sucks up a minute's worth of application stats 
each minute, converts them into "update" commands and writes them directly to 
rrdcached via unix socket.  This app holds a long-lived connection to the 
rrdcached.  A separate monitoring application scrapes stats from a subset of 
the rrds once a minute (using librrd) with the rrdcached daemon address 
configured (ie. it sends a flush to rrdcached prior to fetching).

This works great unless the stat processing application dies.  The minutely 
application stats are buffered up and processed at high-speed when the stats 
processing app is restarted.  (currently a manual restart, so there could be 
hours worth of stats to crunch).  While catching up, in excess of 10,000 
updates/sec are sent to rrdcached.

The updates appear to accepted by rrdcached (response to each update is "0 
errors, enqueued 1 value(s).").  However, a large number of the rrds have no 
data for the whole catching up time while some have all the data.  Rrd info on 
the problem files shows this (even after flush and rrdcached restart):


filename = "acme.rrd"
rrd_version = "0003"
step = 60
last_update = 1318330860
header_size = 8248

However, manually adding the next time results in this error:

update /var/acme/stats/rrd/customer/acme/acme_v1.rrd 
1318330920:2741387100176:2744313219528:2344931134016:2321908650392:2366106587368:2384946112584:17:73:753673980704:771235491136:357218014544:348830922000:1743477:142642:36463:213:208:5:156248210:1312447
-1 illegal attempt to update using time 1318330920 when last update time is 
1318334820 (minimum one second step)

Why does rrdcached think the last update time is 1318334820 when the file 
is 1318330860 ?

Restarting rrdcached has no effect.  The only work-around I found was to 
shutdown all rrd-using apps, restart rrdcached (so journals get 
replayed/flushed), stop rrdcached, delete journals and restart rrdcached.  Now 
the manual "update" command succeeds.

The only errors logged by rrdcached are of the " illegal attempt to update 
using time XXXX" variety.

Theory:  Under some condition(s) rrdcached detects (incorrectly?) an error in 
an update, but still advances it's internal notion of the last_update_time.  
This condition carries forward causing each subsequent update to get an error 
and push the last_update_time forward.  However, the internal last_update_time 
never gets written to the rrd file (which is a good thing, otherwise I would 
not be able to recover).  This condition persists across journal replays.


If you're still with me, any thoughts?


Steve


_______________________________________________
rrd-users mailing list
[email protected]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users

Reply via email to