Hello everyone, I'm embarking on some monitoring enhancements and I'm 
wanting to make it really easy for my colleagues (who don't know 
Prometheus) to write simple tests and have test failures show up as a 
Prometheus alert.

The fundamental business need is to make it really easy to any admto add 
high-signal-low-noise alerts, particularly as a result of incident response.

My initial idea looks something like this:

I have a directory, similar in spirit to /etc/cron.d/ or /etc/cron.hourly/, 
which are essentially scripts that test particular aspects of the 
system/application. Eg. you might have a simple bash or Python script that 
does some test and returns some response (eg. return code of 0 might 
indicate 'assertion-passed', and non-0 might indicate 'assertion-failed'). 
The name of the test might be taken from the filename of the script.

A test-runner would run all of these scripts and generate appropriate 
metrics for consumption using textfile-collector

Playing with what this might look like in terms of metrics, and thinking 
about instances where this would have useful in the past, it might look 
like:

# HELP assertion Assertion is passing (1) or failing (0)
# TYPE assertian GAUGE
assertion{test_name="apache_httpd_configtest_okay"} = 1
assertion{test_name="drbd_synced"} = 1
assertion{test_name="transfer_queue_not_stuck"} = 0
assertion{test_name="can_reach_ldap_server"} = 1
assertion{test_name="connected_to_accounting_service"} = 1
assertion{test_name="federation_metadata_current"} = 1
assertion{test_name="login_test"} = 1

Before I go further down this path, I'm wanting to know if others have done 
something similar and to survey what works and what doesn't, so I don't 
take my group down a wrong path. After all, it doesn't take much playing to 
determine follow-on requirements such as:

* I want to have this be a easy as dropping in some logic (tests) in a 
manner befitting how the server/service being deployed is deployed (eg. 
manually, via Ansible, etc. etc.). This must not require the Prometheus 
team to be the ones having to create the tests, only raise alerts in 
response to assertion failures.
* I need to have some tests run much more frequently than others (like 
unit-tests are to integration-tests)
* Some assertions will be warnings, others will be more critical
* Process supervision must be present to handle process timeout etc.
* If we migrate from Prometheus to something else (or a later major version 
of Prometheus), I want this collateral to still be useful, so I need a 
decent abstraction interface.
* Oh, and this must work on Linux systems as well as as Windows.

Hopefully I'm not taking Prometheus in a direction that is terribly 
foreign, as this seems to be something that I imagine others have already 
walked.

Thanks for reading,
Cameron Kerr

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/63ab82ee-8099-4fb9-be89-e04128659bf8%40googlegroups.com.

Reply via email to