Re: [PATCH 1/9] MAJOR: contrib/prometheus-exporter: move health check status to labels

2021-02-01 Thread William Dauchy
On Mon, Feb 1, 2021 at 12:05 PM Pierre Cheynier  wrote:
> In addition to this update, I would add some recommendations about the user
> setup in the README ("how do I prevent my prometheus instance to explode
> when scrapping this endpoint?").
> For server-template users:
> - 
>   params:
> no-maint:
> - empty
>
> Generally speaking, to prevent all server metrics to be saved, except this 
> one:
> - 
>metric_relabel_configs:
>- source_labels: ['__name__']
>   regex: 'haproxy_(process_|frontend_|backend_|server_check_status).*'
>   action: keep

agreed taken into account in v2

> "Status of last health check, per state value" ?

updated in v2.
-- 
William



RE: [PATCH 1/9] MAJOR: contrib/prometheus-exporter: move health check status to labels

2021-02-01 Thread Pierre Cheynier

De : William Dauchy 
Envoyé : samedi 30 janvier 2021 16:21

> this is a follow up of commit c6464591a365bfcf509b322bdaa4d608c9395d75
> ("MAJOR: contrib/prometheus-exporter: move ftd/bkd/srv states to
> labels"). The main goal being to be better aligned with prometheus use
> cases in terms of queries. More specifically to health checks, Pierre C.
> mentioned the possible quirks he had to put in place in order to make
> use of those metrics through prometheus:
> 
>    by(proxy, check_status) (count_values by(proxy,
>   instance) ("check_status", haproxy_server_check_status))

Indeed, we wanted to aggregate at backend level to produce a view which
represent the "number of servers per health status". Still, this was not ideal
since I'm getting the status code and not the status name.

> I am perfectly aware this introduces a lot more metrics but I don't see
> how we can improve the usability without it. The main issue remains in
> the cardinality of the states which are > 20. Prometheus recommends to
> stay below a cardinality of 10 for a given metric but I consider our
> case very specific, because highly linked to the level of precision
> haproxy exposes.
> 
> Even before this patch I saw several large production setup (a few
> hundreds of MB in output) which are making use of the scope parameter to
> simply ignore the server metrics, so that the scrapping can be faster,
> and memory consumed on client side not too high. So I believe we should
> eventually continue in that direction and offer more granularity of
> filtering of the output. That being said it is already possible to
> filter out the data on prometheus client side.

True, I think this is the right approach to represent such data in Prometheus.
It will create some challenges for people like us, who use server-templates
and create thousands of backends, but we'll have to deal with it.
In addition to this update, I would add some recommendations about the user
setup in the README ("how do I prevent my prometheus instance to explode
when scrapping this endpoint?").
For server-template users:
- 
  params:
no-maint:
- empty

Generally speaking, to prevent all server metrics to be saved, except this one:
- 
   metric_relabel_configs:
   - source_labels: ['__name__']
  regex: 'haproxy_(process_|frontend_|backend_|server_check_status).*'
  action: keep

A long-term alternative (at least for my use-case) would be to provide this 
data at
backend-level, as initially suggested here:
https://www.mail-archive.com/haproxy@formilux.org/msg35369.html

> Signed-off-by: William Dauchy 
> ---

> @@ -319,7 +320,7 @@ const struct ist promex_st_metric_desc[ST_F_TOTAL_FIELDS] 
> = {
>  [ST_F_RATE]   = IST("Current number of sessions per second 
> over last elapsed second."),
>  [ST_F_RATE_LIM]   = IST("Configured limit on new sessions per 
> second."),
>  [ST_F_RATE_MAX]   = IST("Maximum observed number of sessions per 
> second."),
> -   [ST_F_CHECK_STATUS]   = IST("Status of last health check 
> (HCHK_STATUS_* values)."),
> +   [ST_F_CHECK_STATUS]   = IST("Status of last health check (0/1 
> depending on current `state` label value)."),

"Status of last health check, per state value" ?

--
Pierre



[PATCH 1/9] MAJOR: contrib/prometheus-exporter: move health check status to labels

2021-01-30 Thread William Dauchy
this patch is a breaking change between v2.3 and v2.4: we move from
using gauge value for health check states to labels values. The diff is
quite small thanks to the preparation work from Christopher to allow
more flexibility in labels, see commit
5a2f938732126f43bbf4cea5482c01552b0d0314 ("MEDIUM:
contrib/prometheus-exporter: Use dynamic labels instead of static ones")

this is a follow up of commit c6464591a365bfcf509b322bdaa4d608c9395d75
("MAJOR: contrib/prometheus-exporter: move ftd/bkd/srv states to
labels"). The main goal being to be better aligned with prometheus use
cases in terms of queries. More specifically to health checks, Pierre C.
mentioned the possible quirks he had to put in place in order to make
use of those metrics through prometheus:

   by(proxy, check_status) (count_values by(proxy,
  instance) ("check_status", haproxy_server_check_status))

I am perfectly aware this introduces a lot more metrics but I don't see
how we can improve the usability without it. The main issue remains in
the cardinality of the states which are > 20. Prometheus recommends to
stay below a cardinality of 10 for a given metric but I consider our
case very specific, because highly linked to the level of precision
haproxy exposes.

Even before this patch I saw several large production setup (a few
hundreds of MB in output) which are making use of the scope parameter to
simply ignore the server metrics, so that the scrapping can be faster,
and memory consumed on client side not too high. So I believe we should
eventually continue in that direction and offer more granularity of
filtering of the output. That being said it is already possible to
filter out the data on prometheus client side.

this is related to github issue #1029

Signed-off-by: William Dauchy 
---
 .../prometheus-exporter/service-prometheus.c| 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/contrib/prometheus-exporter/service-prometheus.c 
b/contrib/prometheus-exporter/service-prometheus.c
index dbf4c7f39..a3141d39d 100644
--- a/contrib/prometheus-exporter/service-prometheus.c
+++ b/contrib/prometheus-exporter/service-prometheus.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -319,7 +320,7 @@ const struct ist promex_st_metric_desc[ST_F_TOTAL_FIELDS] = 
{
[ST_F_RATE]   = IST("Current number of sessions per second over 
last elapsed second."),
[ST_F_RATE_LIM]   = IST("Configured limit on new sessions per 
second."),
[ST_F_RATE_MAX]   = IST("Maximum observed number of sessions per 
second."),
-   [ST_F_CHECK_STATUS]   = IST("Status of last health check (HCHK_STATUS_* 
values)."),
+   [ST_F_CHECK_STATUS]   = IST("Status of last health check (0/1 depending 
on current `state` label value)."),
[ST_F_CHECK_CODE] = IST("layer5-7 code, if available of the last 
health check."),
[ST_F_CHECK_DURATION] = IST("Total duration of the latest server health 
check, in seconds."),
[ST_F_HRSP_1XX]   = IST("Total number of HTTP responses."),
@@ -886,6 +887,7 @@ static int promex_dump_srv_metrics(struct appctx *appctx, 
struct htx *htx)
int ret = 1;
double secs;
enum promex_srv_state state;
+   const char *check_state;
int i;
 
for (;appctx->st2 < ST_F_TOTAL_FIELDS; appctx->st2++) {
@@ -963,8 +965,17 @@ static int promex_dump_srv_metrics(struct appctx *appctx, 
struct htx *htx)
case ST_F_CHECK_STATUS:
if ((sv->check.state & 
(CHK_ST_ENABLED|CHK_ST_PAUSED)) != CHK_ST_ENABLED)
goto next_sv;
-   val = mkf_u32(FN_OUTPUT, 
sv->check.status);
-   break;
+
+   for (i = 0; i < 
HCHK_STATUS_SIZE; i++) {
+   val = 
mkf_u32(FO_STATUS, sv->check.status == i);
+   check_state = 
get_check_status_info(i);
+   labels[2].name = 
ist("state");
+   labels[2].value = 
ist2(check_state, strlen(check_state));
+   if 
(!promex_dump_metric(appctx, htx, prefix, _st_metrics[appctx->st2],
+   
, labels, , max))
+   goto full;
+   }
+   goto next_sv;
case ST_F_CHECK_CODE:
if ((sv->check.state & 
(CHK_ST_ENABLED|CHK_ST_PAUSED)) != CHK_ST_ENABLED)