> I've recently started monitoring a large fleet of hardware devices > using a combination of blackbox, snmp, node, and json exporters. I > started out using the *up* metric, but I noticed when using blackbox > ping, *up* is *always* 1 even when the device is offline. So I plan to > switch to *probe_success* instead. But I'm thinking about the > implications of this when mixed with other exporters. For example > json-exporter does not offer a *probe_success* metric; instead it > returns *up*=0 when the target times out.
Roughly speaking, up{} tells you if the exporter is running (technically whether it responds okay to being scraped) and then the exporter may have its own metrics to say whether or not it's been successful at generating metrics or doing whatever it normally does. Some exporters always succeed if they're up; some exporters have more granular success or failure (for example, individual collectors in the node exporter); some exporters have completely decoupled up and success statuses, as is the case with the blackbox exporter (where the exporter is often not even on the machines you're checking with a particular probe). Complicating the picture, if an exporter is down (its up is 0), then it's not generating any metrics and any success metrics it would normally generate are absent instead of reporting failure. (The 'up' metric is internally generated by Prometheus itself based on whether the scrape succeeded or not, so an exporter, such as the Blackbox exporter, can only influence it by not responding at all, which would mean that Blackbox can't return any metrics that might explain why the probe failed. Even for ICMP probes there can be multiple reasons for the failure.) The Blackbox exporter is a bit tricky to understand in relation to up{}, because unlike many exporters you create multiple scrape targets against (or through) the same exporter. This generally means you want to ignore the up{} metric for any particular blackbox probe and instead scrape Blackbox's metric endpoint and pay attention to its up{} (for alerts, for example). Other exporters are much more one to one; you scrape each exporter once through one target, so there's only one up{} metric that goes to 0 if that particular exporter instance isn't responding. (However this is not universally true; there are other multi-target indirect exporters like Blackbox. I believe that the SNMP exporter is another one where you often have one exporter separately scraping a lot of targets, and each target will have its own up{} metric that you probably want to ignore.) > My goal is to build a Grafana dashboard and alerts that monitors a > combination of blackbox and other exporters. For context, when certain > devices crash, they remain pingable, but they return their failed > state via REST API. So I'm setting the json-exporter to an HTTP target > endpoint. I'm struggling to come up with a unified way of monitoring > all these different devices types. Unfortunately there is no unified way, as far as I know. If you want one in the Grafana frontend, you might need to make up some sort of synthetic 'is-up' metric through recording rules that know how to combine all of the various status results into one piece of information. (I don't think Grafana has a way of defining 'functions' like this can be used across multiple panels and reused between dashboards, but I'm out of touch with the latest features in current versions of Grafana.) In our environment, it's useful for us to have a granular view of what has failed. That a device has stopped pinging is a different issue than its node_exporter not being up, so our dashboards (and alerts) reflect that. However, we have a small enough number of devices that we can deal with things this verbosely. - cks -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3783582.1701144933%40apps0.cs.toronto.edu.