While this is a possible pattern, it doesn't typically follow Prometheus
best practices.

The idea behind Prometheus is you want to expose data directly from the
thing being monitored, and make the logical decision for "ok/not-ok" on the
Prometheus server. This allows for a lot of advantages over host-local
checks. For example, you take into account data from your entire fleet, not
just what the node can see in isolation.

For example, you have a drbd check, it would be better to get the drbd
metrics directly. For example, using the drbd collector in the
node_exporter.

Other checks you have there, like the login test, are good examples of
blackbox tests. But beware that blackbox tests have limited usefulness.
They are blind during the time between probes. They also don't really tell
you anything about the actual user requests going on. For something like
logins, you would want to gather all login attempts, count them, count
failures, etc. For systems that you can't instrument directly, like
standard OS logins, you could do this with a log tailing metrics generator.
For example https://github.com/google/mtail. There are several variations
of this kind of tool.

One thing to think about, if you generate "normal" Prometheus metrics, many
other monitoring platforms support this format now. So building
Prometheus-compatible exporters is not wasted effort. Even the proprietary
vendors support processing Prometheus data.

On Mon, May 11, 2020 at 5:32 AM Cameron Kerr <[email protected]>
wrote:

> Hello everyone, I'm embarking on some monitoring enhancements and I'm
> wanting to make it really easy for my colleagues (who don't know
> Prometheus) to write simple tests and have test failures show up as a
> Prometheus alert.
>
> The fundamental business need is to make it really easy to any admto add
> high-signal-low-noise alerts, particularly as a result of incident response.
>
> My initial idea looks something like this:
>
> I have a directory, similar in spirit to /etc/cron.d/ or
> /etc/cron.hourly/, which are essentially scripts that test particular
> aspects of the system/application. Eg. you might have a simple bash or
> Python script that does some test and returns some response (eg. return
> code of 0 might indicate 'assertion-passed', and non-0 might indicate
> 'assertion-failed'). The name of the test might be taken from the filename
> of the script.
>
> A test-runner would run all of these scripts and generate appropriate
> metrics for consumption using textfile-collector
>
> Playing with what this might look like in terms of metrics, and thinking
> about instances where this would have useful in the past, it might look
> like:
>
> # HELP assertion Assertion is passing (1) or failing (0)
> # TYPE assertian GAUGE
> assertion{test_name="apache_httpd_configtest_okay"} = 1
> assertion{test_name="drbd_synced"} = 1
> assertion{test_name="transfer_queue_not_stuck"} = 0
> assertion{test_name="can_reach_ldap_server"} = 1
> assertion{test_name="connected_to_accounting_service"} = 1
> assertion{test_name="federation_metadata_current"} = 1
> assertion{test_name="login_test"} = 1
>
> Before I go further down this path, I'm wanting to know if others have
> done something similar and to survey what works and what doesn't, so I
> don't take my group down a wrong path. After all, it doesn't take much
> playing to determine follow-on requirements such as:
>
> * I want to have this be a easy as dropping in some logic (tests) in a
> manner befitting how the server/service being deployed is deployed (eg.
> manually, via Ansible, etc. etc.). This must not require the Prometheus
> team to be the ones having to create the tests, only raise alerts in
> response to assertion failures.
> * I need to have some tests run much more frequently than others (like
> unit-tests are to integration-tests)
> * Some assertions will be warnings, others will be more critical
> * Process supervision must be present to handle process timeout etc.
> * If we migrate from Prometheus to something else (or a later major
> version of Prometheus), I want this collateral to still be useful, so I
> need a decent abstraction interface.
> * Oh, and this must work on Linux systems as well as as Windows.
>
> Hopefully I'm not taking Prometheus in a direction that is terribly
> foreign, as this seems to be something that I imagine others have already
> walked.
>
> Thanks for reading,
> Cameron Kerr
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/63ab82ee-8099-4fb9-be89-e04128659bf8%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/63ab82ee-8099-4fb9-be89-e04128659bf8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmqHa72d3UkY8EXh4dX_LvQJ563dm65eG-Z%2BxYUOXOCkXA%40mail.gmail.com.

Reply via email to