Bert Driehuis wrote: > > I've had an unrelated cause for spikes that render the charts of the > history of my Squid server useless. RRDtool (at least in the way it is > called by Cricket, but I believe this to be generally true) cannot cope > with a restart of an SNMP agent. COUNTER objects wind up with huge
<IMO mode=humble> It doesn't need to do so as it is the front-end that should cope with this. The front-end (Cricket in this case) should detect the restart and write an unknown to RRDtool. > values in this scenario: > > Time Value PDP Value > t 200 0 > t+1 250 50 > t+2 300 50 > [ agent restarts, counter goes to zero ] > t+3 5 4.2e9 > In addition to above remarks, RRDtool can offer some protection if you, or the front-end that asks RRDtool to create the database, tell RRDtool to limit the counter values that are possible. Clearly, when the delta (the increase) is around 50 normally, 4200000000 is absurd. Read the tutorial, especially the part about my car. It won't do 4200000000 km/h so I set a limit on the database input. > This can easily be demonstrated by monitoring the value of > system.sysUpTime.0 for a while (it will show a value of about 100), then > restarting your snmpd. Squid will expose this behavior more easily than > a router, as network gear tends to be rarely rebooted (well, Squid > shouldn't die either, but my test server does every once in a while, > usually due to pilot error on my side :-) Indeed, sysUptime could be monitored. As you may or may not know, RRDtool does not do any monitoring. It is the front-end that performs this task. > It could be an artifact of something else I've done wrong, but I think > the code in rrd_update.c to deal with overflow is asking for trouble > anyway. I've attached a diff that replaces that check with an assignment > of NaN, and unless people object, ask Tobi to include it in the next > release. You may have guessed it already: I object. Modifying code that works to mask problems in other code is not done and is generally not necessary and undesirable. I do agree with previous threads on this list that the code could be expanded (perhaps: should be) and allow for arbitrary wrapping values. However, it should do counter wraps, not resets. > Overflows are fairly rare, in my experience. If dealing with them is > important, code needs to be added to Cricket to check to see if > system.sysUpTime.0 has decreased since the previous sample, and in that > case mark the sample with a tag to indicate that it is a valid sample, > but should not be used for a comparison with the previous value. This > would be pretty complicated to do right. > Why? if ((current.sysUptime - previous.sysUptime) < 0) feed_U_to_RRDtool; feed_counter_to_RRDtool; RRDtool receives the unknown value and thus the current interval is invalid. Then, it receives the correct counter value and the next interval will be known. The only thing that needs to be taken care of right now is the update time; RRDtool cannot handle two samples with the same time stamp. This may be an improvement for the wish list but it is also worked around rather simple by feeding the U value at time NOW-1 and the current counter value at NOW (this won't allow for updates to happen each second. Who cares?) </IMO> There have been discussions on this subject a number of times. You may want to pay a visit to the archives if you're interested. Regards, -- __________________________________________________________________ / [EMAIL PROTECTED] [EMAIL PROTECTED] \ | work private | | My employer is capable of speaking therefore I speak only for myself | +----------------------------------------------------------------------+ -- Unsubscribe mailto:[EMAIL PROTECTED] Help mailto:[EMAIL PROTECTED] Archive http://www.ee.ethz.ch/~slist/rrd-developers
