On 10/23/2010 1:06 AM, Simon Hobson wrote: > Philip Peake wrote: > >> The fix I used was one suggested by Alex van den Bogaerdt, which was >> essentially to insert a NaN to indicate that the counter is now in an >> unknown state, followed by a zero, so that the next (real) value will be >> represented correctly. >> >> This worked for my tests, so I deployed the fix. >> >> Now, I use a DB which actually holds one month 4 weeks) of data, with a >> 30 second sampling period. >> I use this DB to display three graphs: >> >> Last month >> Last day >> Last hour >> >> I do this by just setting the start to the appropriate value from <now>. >> >> Strangely, I have noticed that this fix doesn't always work. >> >> What I see if I look back over the data is a sequence looking like this >> (simplified, with thee data sources): >> >> T1 1000 1004 997 >> T2 1010 1020 1003 >> T3 NaN Nan NaN >> T4 NaN NaN NaN >> T5 0 0 0 >> T6 0 0 0 >> T7 0 0 0 >> T8 4E6 4E6 4E6 >> T9 15 12 10 >> >> No spike is displayed on the month or day graphs, but one is displayed >> on the hour graph. >> >> Two odd things (to me) - Why is rrd still recording a counter roll-over >> value? >> Why does the same data show a spike on one graph, but not on the other two? >> >> I suppose the third question might be why isn't the roll-over recorded >> with the first zero rather than the first non-zero? > I suspect all three questions may be related. There is a distinct but > small time period where your updates may get out of sync. If an > update occurs between you writing NaN and zero, then your zero won't > work and the previous count doesn't get properly reset. In fact, > depending on the timing, it's entirely possible an update is missing > because it failed due to "time standing still" (ie two updates with > the same timestamp). > > In fact, if you are updating every 30 seconds, there is a 1 in 15 > chance of a clash. Your reset script will take two seconds of time in > the rrd file to do it's work (ie update to NaN at time t, update to 0 > at time t+1second). Thus two seconds of time are not available in a > 30 second window) for your script to update the file. > > I'd be inclined to add some logging statement to your scripts to log > the actual update statements they are using to a text file - that > way, when you next see the problem occur, your can refer to the text > file and see what actual updates were done - and replay them into a > fresh file a step at a time while monitoring the result.
Simon, the script forces log data on 30 second boundaries, I use calculated times, not "now". This includes the NaN value when a data source disappears, and for the zero values entered into it every 30 seconds until the source comes back online. I have dumped the DB values, and see exactly what I expect - increasing values, a NaN (well, I actually got two Nans), then a string of zeros followed by a HUGE number (rrd thinks a counter rollover occurred, but only when it sees a non-zero data value ????) followed by data source readings. _______________________________________________ rrd-users mailing list [email protected] https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
