Hey everyone!

I’d like to get your gut reaction on an idea for the future of alarming. Should 
I or should I not put it up for debate at the design summit?

---TL;DR
Online algorithms for computing stream statistics over sliding windows would 
allow us to provide sample statistics within an error bound (e.g. "The average 
cpu utilization in the last hour was 85% +/- 1%”), while significantly reducing 
the load and memory requirements of the computation.
—

Alarm evaluation currently recalculates the aggregate values each time the 
alarm is evaluated, which is problematic because of the load it puts on the 
system. There have been multiple ideas on how to solve this problem, from 
precalculating aggregate values 
(https://wiki.openstack.org/wiki/Ceilometer/Alerting#Precalculation_of_aggregate_values)
 to re-architecting the alarms into the sample pipeline 
(https://wiki.openstack.org/wiki/Ceilometer/AlarmImprovements). While Sandy's 
suggestions make sense from the performance viewpoint, the problem of 
scalability remains. Samples in the pipeline need to be kept in-memory for the 
whole evaluation window, which requires O(N) memory for a window of size N.

We could tackle this problem by using cutting edge research in streaming 
algorithms, namely the papers by Datar et al. [1], and Arasu et al. [2]. They 
provide algorithms for computing stream statistics over sliding windows, such 
as *count, avg, min, max* and even *percentile*, **online** and with 
polylogarithmic space requirements. The tradeoff is of course precision, but 
the algorithms are bounded on the relative error - which could be specified by 
the user.

If we can tell the user "The average cpu utilization in the last hour was 85% 
+/- 1%", would that not be enough for most use cases, while severely reducing 
the load on the system? We could still support *error_rate=0*, which would 
simply use O(N) space and provide a precise answer for the cases where such an 
answer is needed.

These algorithms were developed with telcos and computer network monitoring in 
mind, "in which information about current network performance—latency, 
bandwidth, etc.—is generated online and is used to monitor and adjust network 
performance dynamically"[1]. IIUC the main user of alarms is Heat autoscaling, 
which is exactly the kind of problem suitable to 'soft' calculations, with a 
certain tolerance for error.

[1] Datar, Mayur, et al. "Maintaining stream statistics over sliding windows." 
*SIAM Journal on Computing* 31.6 (2002): 1794-1813. PDF @ 
http://ilpubs.stanford.edu:8090/504/1/2001-34.pdf

[2] Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles 
over sliding windows." *Proceedings of the twenty-third ACM 
SIGMOD-SIGACT-SIGART symposium on Principles of database systems.* ACM, 2004. 
PDF @ http://ilpubs.stanford.edu:8090/624/1/2003-72.pdf

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to