RRD Developers and Tobi Oetiker, As you may be aware, at WebTV we make extensive use of RRDtool and Cricket. We use it not only for real-time monitoring of network hardware such as router interfaces and switch ports, but real-time monitoring of software applications and processes running on a large number of Solaris hosts.
We have long debated the best approach to incorporating real-time aberrant behavior detection into the monitoring system. Cricket implements thresholding; that is it generates alerts (via email or SNMP traps) when the time series which it is enable exceeds absolute bounds set in the Cricket configuration files. This mechanism is simple and effective for detecting some kinds of aberrant behavior. However, there is a need for a more sophisticated algorithm for aberrant behavior detection. Unfortunately, there is no uniformly best choice, but there are desirable characteristics we can select for: (1) Provides near real-time detection for the monitoring application. (2) Adapts over time in real-time as the time series evolves. (3) Low computation and disk overhead. (4) Easy to understand and tune. Given these goals, and our reliance on RRDtool and Cricket, we are proceeding to implement such an algorithm in RRDtool. While such functionality could be encoded in another stand alone application, the primary motivation for adding this functionality to RRDtool is efficiency. At WebTV, we are acutely aware of the fact that a small inefficiency, perhaps inperceivable at the single process level, can result in a significance performance impediment as the number of processes scales up. A further advantage of RRDtool is leveraging the package of graphing capabilties already included in RRDtool. We would prefer these enhancements to RRDtool, once completed and in service here at WebTV, be incorporated into the public distribution of RRDtool. That of course, is a decision to be made by the RRD community. There are number of questions to asked: (1) What are the plans for next big version of RRD? I know that smoothing algorihtms have already been proposed. The aberrant behavior detection algorithm does provide a smoothing algorithm as a subset. (2) Should aberrant behavior detection be available in RRD? I think most network administrators agree the functionality is desirable, the question is: should it be a part of RRD? This is the longstanding trade-off between modular code and efficient code. (3) If so, what algorithms should be used? I will freely admit what we are implementing at WebTV is not an optimal algorithm. However, many algorithms are inappropriate for real time monitoring, or are far too complicated for a network technician to tune without a PhD consultant looking over his shoulder. A draft description of our implementation (already underway) is at http://cricket.sourceforge.net/aberrant/rrd_hw.htm. The document is primarily a discussion of implementation, not of the aberrant behavior detection algorithm. This implementation touches many of the core C files of RRD. At the same time, RRD file structure on disk is unchanged. The enhanced tool will run with existing RRD files. This backwards compatibility is essential, because we know our aberrant behavior detection algorithm is only appropriate for a subset of time series. In some cases simple thresholding (as Cricket provides) is sufficient. In others, the processing cost of aberrant behavior detection is too high relative to the potential benefit. I invite your comments on this project. As with any of our modifications to RRD, we want to share them with the RRD community as a patch. We are not ready to do so yet, but plan to do so by the end of July. I thought it best to start a discussion on this topic sooner rather than later. Sincerely, Jake Brutlag Network Analyst Microsoft WebTV
