Hello everyone, I'm embarking on some monitoring enhancements and I'm
wanting to make it really easy for my colleagues (who don't know
Prometheus) to write simple tests and have test failures show up as a
Prometheus alert.
The fundamental business need is to make it really easy to any admto add
high-signal-low-noise alerts, particularly as a result of incident response.
My initial idea looks something like this:
I have a directory, similar in spirit to /etc/cron.d/ or /etc/cron.hourly/,
which are essentially scripts that test particular aspects of the
system/application. Eg. you might have a simple bash or Python script that
does some test and returns some response (eg. return code of 0 might
indicate 'assertion-passed', and non-0 might indicate 'assertion-failed').
The name of the test might be taken from the filename of the script.
A test-runner would run all of these scripts and generate appropriate
metrics for consumption using textfile-collector
Playing with what this might look like in terms of metrics, and thinking
about instances where this would have useful in the past, it might look
like:
# HELP assertion Assertion is passing (1) or failing (0)
# TYPE assertian GAUGE
assertion{test_name="apache_httpd_configtest_okay"} = 1
assertion{test_name="drbd_synced"} = 1
assertion{test_name="transfer_queue_not_stuck"} = 0
assertion{test_name="can_reach_ldap_server"} = 1
assertion{test_name="connected_to_accounting_service"} = 1
assertion{test_name="federation_metadata_current"} = 1
assertion{test_name="login_test"} = 1
Before I go further down this path, I'm wanting to know if others have done
something similar and to survey what works and what doesn't, so I don't
take my group down a wrong path. After all, it doesn't take much playing to
determine follow-on requirements such as:
* I want to have this be a easy as dropping in some logic (tests) in a
manner befitting how the server/service being deployed is deployed (eg.
manually, via Ansible, etc. etc.). This must not require the Prometheus
team to be the ones having to create the tests, only raise alerts in
response to assertion failures.
* I need to have some tests run much more frequently than others (like
unit-tests are to integration-tests)
* Some assertions will be warnings, others will be more critical
* Process supervision must be present to handle process timeout etc.
* If we migrate from Prometheus to something else (or a later major version
of Prometheus), I want this collateral to still be useful, so I need a
decent abstraction interface.
* Oh, and this must work on Linux systems as well as as Windows.
Hopefully I'm not taking Prometheus in a direction that is terribly
foreign, as this seems to be something that I imagine others have already
walked.
Thanks for reading,
Cameron Kerr
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/63ab82ee-8099-4fb9-be89-e04128659bf8%40googlegroups.com.