[PATCH] BUG/MINOR: promex: fix backend_agg_check_status

2023-09-12 Thread Cedric Paillet
When a server is in maintenance, the check.status is no longer updated. 
Therefore, we shouldn't consider check.status if the checks are not active. 
This check was properly implemented in the haproxy_server_check_status metric, 
but wasn't carried over to backend_agg_check_status, which introduced 
inconsistencies between the two metrics."
---
 addons/promex/service-prometheus.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..6885d202e 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,8 +863,10 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
-   srv_check_status = 
sv->check.status;
-   
srv_check_count[srv_check_status] += 1;
+   if ((sv->check.state & 
(CHK_ST_ENABLED|CHK_ST_PAUSED)) == CHK_ST_ENABLED) {
+   srv_check_status = 
sv->check.status;
+   
srv_check_count[srv_check_status] += 1;
+   }
sv = sv->next;
}
for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
-- 
2.25.1



RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-12 Thread Cedric Paillet
Hi,

Sorry, I now understand my problem better.

When a server is in maintenance, the health checks are not performed and 
sv->check.state == CHK_ST_PAUSED.
This is checked in the haproxy_server_check_status metric but not in 
backend_agg_check_status.
I will check my new patch and push it afterwards.

Thank you very much.


-Message d'origine-
De : Christopher Faulet  
Envoyé : lundi 11 septembre 2023 20:15
À : Cedric Paillet ; haproxy@formilux.org
Objet : Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

Le 07/09/2023 à 16:50, Cedric Paillet a écrit :
> 
>> And I guess we should also check the healthchecks are enabled for the 
>> server. It is not really an issue because call to get_check_status_result() 
>> will exclude neutral and unknown satuses. But there is no reason to count 
>> these servers.
> 
> What we observed is that the health check of servers that have been put into 
> maintenance is no longer updated, and the status returned is the last known 
> one. I need to double-check, but I believe we even saw L7OK when putting a 
> "UP" server into maintenance. (What I'm sure of is that the majority of 
> servers in maintenance were in L7STS and not in UNK).
> 
> If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to 
> display an incorrect backend_agg_server_status (and also 
> haproxy_server_check_status) for servers in maintenance for those who don't 
> set (no-maint=empty), and we probably then need another patch so that the 
> status of these servers is HCHK_STATUS_UNKNOWN?"
> 

Health-checks for servers in maintenance are paused. So indeed, the last known 
status does not change anymore in this state. My purpose here was to also 
filter servers to only count those with health-checks enabled and running. When 
the server's metrics are dumped, the check status is already skipped for 
servers in maintenance. Thus it seems logical to not count them for the 
aggregated metric.

In fact, this way, all servers in maintenance are skipped without checking the 
server's admin state. But it is probably cleaner to keep both checks for 
consistency. Except if I missed something. This part is not really clear for me 
anymore...

--
Christopher Faulet



[PATCH] MINOR: promex: Introduce 'FULL' status for srv_state metric

2023-09-11 Thread Cedric Paillet
This adds a new 'FULL' status to the Prometheus metric 'srv_state'. It helps 
identify servers that have exceeded their maxconn limit and cannot accept new 
connections.
Rename server_has_room to !server_is_full to matches what's used at a few 
places in the doc in association with servers or backends being "full".
---
 addons/promex/service-prometheus.c | 7 +++
 include/haproxy/queue.h| 6 +++---
 src/backend.c  | 2 +-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index 7b6081fab..40ec00085 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -380,6 +380,7 @@ enum promex_srv_state {
PROMEX_SRV_STATE_UP,
PROMEX_SRV_STATE_MAINT,
PROMEX_SRV_STATE_DRAIN,
+   PROMEX_SRV_STATE_FULL,
PROMEX_SRV_STATE_NOLB,
 
PROMEX_SRV_STATE_COUNT /* must be last */
@@ -390,10 +391,13 @@ const struct ist promex_srv_st[PROMEX_SRV_STATE_COUNT] = {
[PROMEX_SRV_STATE_UP]= IST("UP"),
[PROMEX_SRV_STATE_MAINT] = IST("MAINT"),
[PROMEX_SRV_STATE_DRAIN] = IST("DRAIN"),
+   [PROMEX_SRV_STATE_FULL] = IST("FULL"),
[PROMEX_SRV_STATE_NOLB]  = IST("NOLB"),
 };
 
 /* Return the server status. */
+
+
 enum promex_srv_state promex_srv_status(struct server *sv)
 {
int state = PROMEX_SRV_STATE_DOWN;
@@ -402,6 +406,9 @@ enum promex_srv_state promex_srv_status(struct server *sv)
state = PROMEX_SRV_STATE_UP;
if (sv->cur_admin & SRV_ADMF_DRAIN)
state = PROMEX_SRV_STATE_DRAIN;
+   if (server_is_full(sv))
+   state = PROMEX_SRV_STATE_FULL;
+
}
else if (sv->cur_state == SRV_ST_STOPPING)
state = PROMEX_SRV_STATE_NOLB;
diff --git a/include/haproxy/queue.h b/include/haproxy/queue.h
index e77370cdd..2817b6c8b 100644
--- a/include/haproxy/queue.h
+++ b/include/haproxy/queue.h
@@ -77,9 +77,9 @@ static inline void pendconn_free(struct stream *s)
}
 }
 
-/* Returns 0 if all slots are full on a server, or 1 if there are slots 
available. */
-static inline int server_has_room(const struct server *s) {
-   return !s->maxconn || s->cur_sess < srv_dynamic_maxconn(s);
+/* Returns 1 if all slots are full on a server, or 0 if there are slots 
available. */
+static inline int server_is_full(const struct server *s) {
+   return s->maxconn && s->cur_sess >= srv_dynamic_maxconn(s);
 }
 
 /* returns 0 if nothing has to be done for server  regarding queued 
connections,
diff --git a/src/backend.c b/src/backend.c
index 2f9da87ab..9b9e63649 100644
--- a/src/backend.c
+++ b/src/backend.c
@@ -658,7 +658,7 @@ int assign_server(struct stream *s)
if (tmpsrv && tmpsrv->proxy == s->be &&
((s->sess->flags & SESS_FL_PREFER_LAST) ||
 (!s->be->max_ka_queue ||
- server_has_room(tmpsrv) || (
+ !server_is_full(tmpsrv) || (
  tmpsrv->queue.length + 1 < s->be->max_ka_queue))) 
&&
srv_currently_usable(tmpsrv)) {
list_for_each_entry(conn, _list->conn_list, 
session_list) {
-- 
2.25.1



[PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-11 Thread Cedric Paillet
The state of servers that were put in maintenance via the runtime API are
reported within the "backend_agg_check_status" metric, which
lead to inconsistent sums when compared to the "haproxy_server_check_status"
metric.

Now excluding them from this computation.
---
 addons/promex/service-prometheus.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..7b6081fab 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
+   if ((ctx->flags & 
PROMEX_FL_NO_MAINT_SRV) && (sv->cur_admin & SRV_ADMF_MAINT))
+   goto next_sv;
srv_check_status = 
sv->check.status;

srv_check_count[srv_check_status] += 1;
-   sv = sv->next;
+   next_sv:
+   sv = sv->next;
}
for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)
-- 
2.25.1



RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-07 Thread Cedric Paillet
> I guess you mean "backend_add_check_status". Because 
> "backend_agg_server_check_status" is a deprecated metric. It was replaced by 
> "backend_agg_server_status".
Yes, sorry (and I should know, it's me who deprecated this value! )"
In this patch both will be up updated, as it's the same code !
I will update the commit message.
(we can deprecate it in 2.9 ? do you want a patch ?)

> Here we should check for PROMEX_FL_NO_MAINT_SRV promex flag, no ? I mean:
Yes, sure.

>And I guess we should also check the healthchecks are enabled for the server. 
>It is not really an issue because call to get_check_status_result() will 
>exclude neutral and unknown satuses. But there is no reason to count these 
>servers.

What we observed is that the health check of servers that have been put into 
maintenance is no longer updated, and the status returned is the last known 
one. I need to double-check, but I believe we even saw L7OK when putting a "UP" 
server into maintenance. (What I'm sure of is that the majority of servers in 
maintenance were in L7STS and not in UNK).

If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to 
display an incorrect backend_agg_server_status (and also 
haproxy_server_check_status) for servers in maintenance for those who don't set 
(no-maint=empty), and we probably then need another patch so that the status of 
these servers is HCHK_STATUS_UNKNOWN?"

Cédric



RE: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric

2023-09-07 Thread Cedric Paillet
Hi,

Tanks for you feedbacks.

> Indeed, so actually you found a pretty valid use case to the edge case I was 
> seeing as rare enough not to be of much interest :-)

Just to provide some background, which might help in understanding our needs:

In our real-life scenarios, servers do not have uniform performance. Our 
haproxy "servers", in our context, are typically containers. These containers 
can reside on very diverse hardware, depending on the age of the server. We 
attempt to mitigate these discrepancies with server weights, but that's not 
always sufficient; not all servers respond at the same speed. The performance 
can also be compromised by "noisy" neighbors co-hosted on the same server. 
Sometimes we even have instances that fail because the underlying code is 
poorly written.
I don't know if it's common with others, but when a backend starts to become 
overloaded, it happens gradually, and during the initial days, the first 
symptoms are a few servers becoming FULL during peak hours.


> Yes that's it. For this specific use case, watching the queue is much simpler 
> and provides more than just a boolean

Your discussion about queues is very interesting, and I believe we will 
incorporate those into our metrics.

My idea is not to present a boolean, but a percentage of FULL servers by 
backend.

We believe it's more illustrative to see that around 7 a.m., all servers are 
functioning, but by 7:10 a.m., 10% of the servers begin to be "FULL" due to the 
increase in traffic. By 7:30 a.m., it's 50%, and by 8 a.m., it's 100%.

We think this curve is simpler to grasp than just seeing, at 8 a.m., a queue 
beginning to grow, while all these servers still seem OK 
(haproxy_backend_agg_check_status=L7OK, and 
haproxy_backend_agg_server_status="UP").

Currently, we compute this metric using this Prometheus calculation:
count by (proxy, instance) (haproxy_server_current_sessions > 180)
(even though we have a maxconn set at 200).

This calculation has several issues:
- maxconn cannot be retrieved; it has to be hardcoded in Prometheus.
- it forces us to use "/metrics?scope=server", which scrapes all server 
metrics, something we're trying to avoid, because it consumes a lot of 
resources to aggragate on prometheus.

With my patch, we can achieve the same with count by (proxy, instance) 
(haproxy_backend_agg_server_status="FULL"), just by using 
/metrics?scope=backend.

The main question I'm asking myself is whether my calculation method will work, 
because a server that starts to saturate will "oscillate" between FULL and OK. 
We risk missing it, not necessarily capturing the value at the right time.

Cedric

-Message d'origine-
De : Willy Tarreau  
Envoyé : mercredi 6 septembre 2023 10:36
À : Cedric Paillet 
Cc : haproxy@formilux.org
Objet : Re: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric

Hi Cedric,

On Tue, Sep 05, 2023 at 01:40:14PM +, Cedric Paillet wrote:
> We are using Prometheus to provide feedback to our users about the 
> status of backend servers. Currently, we have no means of informing 
> them if a server exceeds the maxconn limit, and consequently why it's 
> no longer receiving new requests.
> 
> Therefore, we would like to be able to display when a server surpasses 
> the maxconn limit and is in the "noroom" state. I have prepared a 
> patch specifically for Prometheus, but it might be better to include a 
> boolean directly in the server structure indicating whether the server 
> was considered to have no room the last time server_has_room was 
> called. However, this change seems to have a significant impact on other 
> parts of the code.

I think that it might be more suitable to use the term "FULL" that we already 
use for the listeners, and that also matches what's used at a few places in the 
doc in association with servers or backends being "full".

Also, a more accurate metric that is generally watched is the queue (both 
server and backend): instead of being a boolean, it directly indicates how many 
additional servers are needed to improve the processing time. Persistent 
requests are placed into the server's queue but all other requests go into the 
backend queue. So if you have a total capacity of 4 servers * 100 connections = 
400 outstanding requests, and you see a queue of 200, you know that you'd need 
two extra servers to process these without queuing, and you can even predict 
that the total processing time will decrease by 200*the average queue time, so 
this allows to even add just the required number of servers to keep response 
time below a certain limit.

The only case where a server will be full without having any queue is when the 
total number of outstanding requests on a backend is exactly equal to the sum 
of the servers' maxconn values, so as you see, the extra metric covers very 
little 

RE: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric

2023-09-06 Thread Cedric Paillet
Thanks for your review.

>I think that it might be more suitable to use the term "FULL"
Ok, no problem with that. (Perhaps we can also rename server_has_room to 
!server_is_full ?)


> Also, a more accurate metric that is generally watched is the queue (both 
> server and backend): 

My first use case is to detect/display one or more server(s) with problems, not 
to determine if the entire backend is undersized.
As I understand it, if just one server in the pool is very slow (and the load 
balancing method is round-robin), the number of sessions on this server will 
increase until it reaches maxconn. At this juncture, the server will no longer 
be selected, and requests will be routed to the other servers. Then, no queue 
(either backend or server) will start to fill up, correct? But the slow server 
will cease receiving requests until its session count drops below maxconn, 
right?

The second use case, as you've outlined, is to detect if a backend is 
undersized. My understanding is that if the backend is "nearly" undersized, the 
first symptom will be some servers reporting "FULL". Only when ALL servers are 
reporting "FULL" will the backend queue start to grow, correct?

Cédric


-Message d'origine-
De : Willy Tarreau  
Envoyé : mercredi 6 septembre 2023 10:36
À : Cedric Paillet 
Cc : haproxy@formilux.org
Objet : Re: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric

Hi Cedric,

On Tue, Sep 05, 2023 at 01:40:14PM +, Cedric Paillet wrote:
> We are using Prometheus to provide feedback to our users about the 
> status of backend servers. Currently, we have no means of informing 
> them if a server exceeds the maxconn limit, and consequently why it's 
> no longer receiving new requests.
> 
> Therefore, we would like to be able to display when a server surpasses 
> the maxconn limit and is in the "noroom" state. I have prepared a 
> patch specifically for Prometheus, but it might be better to include a 
> boolean directly in the server structure indicating whether the server 
> was considered to have no room the last time server_has_room was 
> called. However, this change seems to have a significant impact on other 
> parts of the code.

I think that it might be more suitable to use the term "FULL" that we already 
use for the listeners, and that also matches what's used at a few places in the 
doc in association with servers or backends being "full".

Also, a more accurate metric that is generally watched is the queue (both 
server and backend): instead of being a boolean, it directly indicates how many 
additional servers are needed to improve the processing time. Persistent 
requests are placed into the server's queue but all other requests go into the 
backend queue. So if you have a total capacity of 4 servers * 100 connections = 
400 outstanding requests, and you see a queue of 200, you know that you'd need 
two extra servers to process these without queuing, and you can even predict 
that the total processing time will decrease by 200*the average queue time, so 
this allows to even add just the required number of servers to keep response 
time below a certain limit.

The only case where a server will be full without having any queue is when the 
total number of outstanding requests on a backend is exactly equal to the sum 
of the servers' maxconn values, so as you see, the extra metric covers very 
little compared to the queue itself. But there might be good use cases for 
this, I'm not denying it, I just wanted to make sure that you're going to 
monitor what's really relevant for your use case ;-)

Willy



[PATCH 1/1] MINOR: promex: Introduce 'NOROOM' status for srv_state metric

2023-09-05 Thread Cedric Paillet
This adds a new 'noroom' status to the Prometheus metric 'srv_state'. It helps 
identify servers that have exceeded their maxconn limit and cannot accept new 
connections.
---
 addons/promex/service-prometheus.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index 7b61683dd..4551009d5 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -380,6 +380,7 @@ enum promex_srv_state {
PROMEX_SRV_STATE_UP,
PROMEX_SRV_STATE_MAINT,
PROMEX_SRV_STATE_DRAIN,
+   PROMEX_SRV_STATE_NOROOM,
PROMEX_SRV_STATE_NOLB,
 
PROMEX_SRV_STATE_COUNT /* must be last */
@@ -390,10 +391,13 @@ const struct ist promex_srv_st[PROMEX_SRV_STATE_COUNT] = {
[PROMEX_SRV_STATE_UP]= IST("UP"),
[PROMEX_SRV_STATE_MAINT] = IST("MAINT"),
[PROMEX_SRV_STATE_DRAIN] = IST("DRAIN"),
+   [PROMEX_SRV_STATE_NOROOM] = IST("NOROOM"),
[PROMEX_SRV_STATE_NOLB]  = IST("NOLB"),
 };
 
 /* Return the server status. */
+
+
 enum promex_srv_state promex_srv_status(struct server *sv)
 {
int state = PROMEX_SRV_STATE_DOWN;
@@ -402,6 +406,9 @@ enum promex_srv_state promex_srv_status(struct server *sv)
state = PROMEX_SRV_STATE_UP;
if (sv->cur_admin & SRV_ADMF_DRAIN)
state = PROMEX_SRV_STATE_DRAIN;
+   if (!server_has_room(sv))
+   state = PROMEX_SRV_STATE_NOROOM;
+
}
else if (sv->cur_state == SRV_ST_STOPPING)
state = PROMEX_SRV_STATE_NOLB;
-- 
2.25.1



[PATCH 0/1] Introduce 'NOROOM' status for srv_state metric

2023-09-05 Thread Cedric Paillet
We are using Prometheus to provide feedback to our users about the status of 
backend servers. Currently, we have no means of informing them if a server 
exceeds the maxconn limit, and consequently why it's no longer receiving new 
requests.

Therefore, we would like to be able to display when a server surpasses the 
maxconn limit and is in the "noroom" state. I have prepared a patch 
specifically for Prometheus, but it might be better to include a boolean 
directly in the server structure indicating whether the server was considered 
to have no room the last time server_has_room was called. However, this change 
seems to have a significant impact on other parts of the code.


Cedric Paillet (1):
  MINOR: promex: Introduce 'NOROOM' status for srv_state metric

 addons/promex/service-prometheus.c | 7 +++
 1 file changed, 7 insertions(+)

-- 
2.25.1



[PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-01 Thread Cedric Paillet
The state of servers that were put in maintenance via the runtime API are
reported within the "backend_agg_server_check_status" metric, which
lead to inconsistent sums when compared to the "haproxy_server_check_status"
metric.

Now excluding them from this computation.
---
 addons/promex/service-prometheus.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..7b61683dd 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
+   if (sv->cur_admin & 
SRV_ADMF_MAINT)
+   goto next_sv;
srv_check_status = 
sv->check.status;

srv_check_count[srv_check_status] += 1;
-   sv = sv->next;
+   next_sv:
+   sv = sv->next;
}
for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)
-- 
2.25.1



[PATCH 1/2] [PATCH] BUG/MINOR: promex: create haproxy_backend_agg_server_status

2022-12-08 Thread Cedric Paillet
haproxy_backend_agg_server_check_status currently aggregates
haproxy_server_status instead of haproxy_server_check_status.
We deprecate this and create a new one,
haproxy_backend_agg_server_status to clarify what it really
does.
---
 addons/promex/service-prometheus.c | 4 +++-
 include/haproxy/stats-t.h  | 1 +
 src/stats.c| 6 --
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index b54695c17..9330a860c 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -302,6 +302,7 @@ const struct promex_metric 
promex_st_metrics[ST_F_TOTAL_FIELDS] = {
[ST_F_NEED_CONN_EST]= { .n = IST("need_connections_current"),   
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_SRV_METRIC) },
[ST_F_UWEIGHT]  = { .n = IST("uweight"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) },
[ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
+   [ST_F_AGG_SRV_STATUS ]  = { .n = IST("agg_server_status"),  
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
 };
 
 /* Description of overridden stats fields */
@@ -833,7 +834,8 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
return -1;
 
switch (ctx->field_num) {
-   case ST_F_AGG_SRV_CHECK_STATUS:
+   case ST_F_AGG_SRV_CHECK_STATUS: // DEPRECATED
+   case ST_F_AGG_SRV_STATUS:
if (!px->srv)
goto next_px;
sv = px->srv;
diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h
index efa0ac383..253fb6c48 100644
--- a/include/haproxy/stats-t.h
+++ b/include/haproxy/stats-t.h
@@ -455,6 +455,7 @@ enum stat_field {
ST_F_USED_CONN_CUR,
ST_F_NEED_CONN_EST,
ST_F_UWEIGHT,
+   ST_F_AGG_SRV_STATUS,
ST_F_AGG_SRV_CHECK_STATUS,
 
/* must always be the last one */
diff --git a/src/stats.c b/src/stats.c
index 8ed5a7133..c15bc27ec 100644
--- a/src/stats.c
+++ b/src/stats.c
@@ -259,7 +259,8 @@ const struct name_desc stat_fields[ST_F_TOTAL_FIELDS] = {
[ST_F_USED_CONN_CUR] = { .name = "used_conn_cur",   
.desc = "Current number of connections in use"},
[ST_F_NEED_CONN_EST] = { .name = "need_conn_est",   
.desc = "Estimated needed number of connections"},
[ST_F_UWEIGHT]   = { .name = "uweight", 
.desc = "Server's user weight, or sum of active servers' user weights 
for a backend" },
-   [ST_F_AGG_SRV_CHECK_STATUS]  = { .name = 
"agg_server_check_status", .desc = "Backend's aggregated gauge of servers' 
state check status" },
+   [ST_F_AGG_SRV_CHECK_STATUS]  = { .name = 
"agg_server_check_status", .desc = "[DEPRECATED] Backend's aggregated gauge 
of servers' status" },
+   [ST_F_AGG_SRV_STATUS ]   = { .name = "agg_server_status",   
.desc = "Backend's aggregated gauge of servers' status" },
 };
 
 /* one line of info */
@@ -2673,7 +2674,8 @@ int stats_fill_be_stats(struct proxy *px, int flags, 
struct field *stats, int le
chunk_appendf(out, " (%d/%d)", nbup, 
nbsrv);
metric = mkf_str(FO_STATUS, fld);
break;
-   case ST_F_AGG_SRV_CHECK_STATUS:
+   case ST_F_AGG_SRV_CHECK_STATUS:   // DEPRECATED
+   case ST_F_AGG_SRV_STATUS:
metric = mkf_u32(FN_GAUGE, 0);
break;
case ST_F_WEIGHT:
-- 
2.25.1



[PATCH 2/2] [PATCH] MINOR: promex: introduce haproxy_backend_agg_check_status

2022-12-08 Thread Cedric Paillet
This patch introduces haproxy_backend_agg_check_status metric
as we wanted in 42d7c402d but with the right data source.
---
 addons/promex/service-prometheus.c | 26 ++
 include/haproxy/stats-t.h  |  1 +
 src/stats.c|  4 
 3 files changed, 31 insertions(+)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index 9330a860c..5622b8039 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -303,6 +303,7 @@ const struct promex_metric 
promex_st_metrics[ST_F_TOTAL_FIELDS] = {
[ST_F_UWEIGHT]  = { .n = IST("uweight"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) },
[ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
[ST_F_AGG_SRV_STATUS ]  = { .n = IST("agg_server_status"),  
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
+   [ST_F_AGG_CHECK_STATUS] = { .n = IST("agg_check_status"),   
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
 };
 
 /* Description of overridden stats fields */
@@ -812,6 +813,7 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
double secs;
enum promex_back_state bkd_state;
enum promex_srv_state srv_state;
+   enum healthcheck_status srv_check_status;
 
for (;ctx->field_num < ST_F_TOTAL_FIELDS; ctx->field_num++) {
if (!(promex_st_metrics[ctx->field_num].flags & ctx->flags))
@@ -820,6 +822,8 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
while (ctx->px) {
struct promex_label labels[PROMEX_MAX_LABELS-1] = {};
unsigned int srv_state_count[PROMEX_SRV_STATE_COUNT] = 
{ 0 };
+   unsigned int srv_check_count[HCHK_STATUS_SIZE] = { 0 };
+   const char *check_state;
 
px = ctx->px;
 
@@ -854,6 +858,28 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
}
ctx->obj_state = 0;
goto next_px;
+   case ST_F_AGG_CHECK_STATUS:
+   if (!px->srv)
+   goto next_px;
+   sv = px->srv;
+   while (sv) {
+   srv_check_status = 
sv->check.status;
+   
srv_check_count[srv_check_status] += 1;
+   sv = sv->next;
+   }
+   for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
+   if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)
+   continue;
+   val = mkf_u32(FO_STATUS, 
srv_check_count[ctx->obj_state]);
+   check_state = 
get_check_status_info(ctx->obj_state);
+   labels[1].name = ist("state");
+   labels[1].value = 
ist(check_state);
+   if (!promex_dump_metric(appctx, 
htx, prefix, _st_metrics[ctx->field_num],
+   , 
labels, , max))
+   goto full;
+   }
+   ctx->obj_state = 0;
+   goto next_px;
case ST_F_STATUS:
bkd_state = ((px->lbprm.tot_weight > 0 
|| !px->srv) ? 1 : 0);
for (; ctx->obj_state < 
PROMEX_BACK_STATE_COUNT; ctx->obj_state++) {
diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h
index 253fb6c48..c60bf0487 100644
--- a/include/haproxy/stats-t.h
+++ b/include/haproxy/stats-t.h
@@ -457,6 +457,7 @@ enum stat_field {
ST_F_UWEIGHT,
ST_F_AGG_SRV_STATUS,
ST_F_AGG_SRV_CHECK_STATUS,
+   ST_F_AGG_CHECK_STATUS,
 
/* must always be the last one */
ST_F_TOTAL_FIELDS
diff --git 

[PATCH 0/2] BUG/MINOR: promex: create haproxy_backend_agg_check_status

2022-12-08 Thread Cedric Paillet
As described in github issue #1312, the first intention of patch 42d7c402d
was to aggregate haproxy_server_check_status. But we aggregated
haproxy_server_status instead.

To fix that:
- Deprecated haproxy_backend_agg_server_check_status. (Modify the metric 
description)
- create new haproxy_backend_agg_server_status metric, aggregation of 
haproxy_backend_server_status. (to replace misnamed 
haproxy_backend_agg_server_check_status)
- create new haproxy_backend_agg_check_status metric, aggregation of 
haproxy_backend_server_check_status.

Cedric Paillet (2):
  BUG/MINOR: promex: create haproxy_backend_agg_server_status
  MINOR: promex: introduce haproxy_backend_agg_check_status

 addons/promex/service-prometheus.c | 30 +-
 include/haproxy/stats-t.h  |  2 ++
 src/stats.c| 10 --
 3 files changed, 39 insertions(+), 3 deletions(-)

-- 
2.25.1



RE: RE: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status

2022-12-06 Thread Cedric Paillet

>Sorry, indeed I mentioned the enum ID and not the promex name.
>I proposed to keep "haproxy_bakcned_agg_server_check_status"
> altering its description to announce it is deprecated
>. And then to add "haproxy_backend_agg_check_status" 
>aggregate  haproxy_server_check_status) and 
>"haproxy_backend_agg_server_status" 
> or "haproxy_backend_agg_srv_status" (to aggregate haproxy_server_status).

To resume, is it ok for you ?

- Deprecated haproxy_backend_agg_server_check_status. ( Modify the metric 
description )
- create new haproxy_backend_agg_server_status metric, aggregation of 
haproxy_backend_server_status. ( to replace misnamed 
haproxy_backend_agg_server_check_status )
- create new haproxy_backend_agg_check_status metric, aggregation of 
haproxy_backend_server_check_status.

this will be merged/backported in 2.5/2.6/2.7/2.8-dev and 
haproxy_backend_agg_server_check_status will be removed in 2.9 ( or 2.10 )


RE: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status

2022-12-06 Thread Cedric Paillet
> Sorry for the delay. We were busy because of the 2.7.0 and then I forgot your 
> patches :) And it was a bit late to add it into the 2.7.0.
No problem.

>  Because the metric name is indeed badly chosen but we can't break the 
> compatibility, especially in a LTS version.
Ok, I understand that. We need to find new names for the metric

>IMHO, we should keep the current metric and its description should be updated 
>to mention it is deprecated. This way, we will be able to remove it in the 2.9 
>or 3.0. This means we should find new names for the aggregated servers status 
>and the aggregated check status. "ST_F_AGG_SRV_STATUS" is pretty good for the 
>first one. I may propose "ST_F_AGG_CHECK_STATUS" for the second one.

Not very clear, the value to change to keep the compatibility is the metric 
name itself ?

do you propose to use?:
- haproxy_backend_agg_check_status ( to aggregate haproxy_server_check_status )
- haproxy_backend_agg_status ( to aggregate haproxy_server_status )
Or
- haproxy_backend_agg_srv_check_status ( to aggregate 
haproxy_server_check_status
- haproxy_backend_agg_srv_status ( to aggregate haproxy_server_status )

For reminder we currently have
- haproxy_backend_agg_server_check_status  ( to aggregate haproxy_server_status 
)

( and we have also have haproxy_backend_status, for the status of the backend 
itself )

Cedric Paillet


RE: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status

2022-11-29 Thread Cedric Paillet
Hi William,

For me we it's not a problem to backport until 2.4.
Even if it's a breaking change for haproxy_backend_agg_server_check_status, 
it's clearer and make more sense if it's aggregate server_check_status and not 
server_status.

Cedric.



-Message d'origine-
De : William Dauchy  
Envoyé : mardi 29 novembre 2022 12:39
À : Cedric Paillet 
Cc : haproxy@formilux.org; Christopher Faulet 
Objet : Re: [PATCH 0/2] BUG/MINOR: promex: fix 
haproxy_backend_agg_server_check_status

On Tue, Nov 29, 2022 at 11:52 AM William Dauchy  wrote:
> On Mon, Nov 28, 2022 at 8:46 AM Cedric Paillet  wrote:
> > As described in github issue #1312, the first intention of patch 
> > 42d7c402d was to aggregate haproxy_server_check_status. But we 
> > aggregated haproxy_server_status instead.
> > To fix that, rename haproxy_backend_agg_server_check_status to 
> > haproxy_backend_agg_server_status and use right data source for 
> > haproxy_backend_agg_server_check_status.
>
> Thank you for looking into this. This has been on my todo list for too long.
> I am fine with those changes as long as we backport them until the 
> version where this was introduced (this was introduced in v2.5 and 
> then backported in v2.4 if my checks are correct).
> I understand that's a breaking change, some people started to build 
> tooling on top of that can't deal with behaviour change between 
> versions (I am thinking about dashboards in my case where we deal with 
> different versions). In that regard, I believe it can be ok to have to 
> do a one off change to fix this past mistake. If we are not aligned 
> with a backport, we will need to find another name.

I realised I was not necessarily very clear on my position:
- either we merge this patch, and then backport it until 2.4
- either we find a new name for the new metric and don't introduce a breaking 
change

I am mostly worried about metrics collectors who are not able to deal with 
different behaviour from one version to another. In my own case, we have 
different haproxy versions running, and there is no global way we can adapt the 
collector depending on the software version it is collecting. The collector 
expects the same behaviour across versions.
Having a breaking change which affects all versions is fine as it will require 
a unique change on our side, but having a situation in the middle is not 
acceptable for us.

--
William


[PATCH 1/2] [PATCH] BUG/MINOR: promex: rename haproxy_backend_agg_server_check_status

2022-11-27 Thread Cedric Paillet
haproxy_backend_agg_server_check_status currently aggregates
haproxy_server_status instead of haproxy_server_check_status.
Rename the metric to haproxy_backend_agg_server_status to
clarify what it really does.
---
 addons/promex/service-prometheus.c | 4 ++--
 include/haproxy/stats-t.h  | 2 +-
 src/stats.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index d27aefaaa..c24f0895c 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -301,7 +301,7 @@ const struct promex_metric 
promex_st_metrics[ST_F_TOTAL_FIELDS] = {
[ST_F_USED_CONN_CUR]= { .n = IST("used_connections_current"),   
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_SRV_METRIC) },
[ST_F_NEED_CONN_EST]= { .n = IST("need_connections_current"),   
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_SRV_METRIC) },
[ST_F_UWEIGHT]  = { .n = IST("uweight"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) },
-   [ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
+   [ST_F_AGG_SRV_STATUS ]  = { .n = IST("agg_server_status"),  
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
 };
 
 /* Description of overridden stats fields */
@@ -833,7 +833,7 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
return -1;
 
switch (ctx->field_num) {
-   case ST_F_AGG_SRV_CHECK_STATUS:
+   case ST_F_AGG_SRV_STATUS :
if (!px->srv)
goto next_px;
sv = px->srv;
diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h
index efa0ac383..babea31e1 100644
--- a/include/haproxy/stats-t.h
+++ b/include/haproxy/stats-t.h
@@ -455,7 +455,7 @@ enum stat_field {
ST_F_USED_CONN_CUR,
ST_F_NEED_CONN_EST,
ST_F_UWEIGHT,
-   ST_F_AGG_SRV_CHECK_STATUS,
+   ST_F_AGG_SRV_STATUS ,
 
/* must always be the last one */
ST_F_TOTAL_FIELDS
diff --git a/src/stats.c b/src/stats.c
index 97e7fd865..dd3044b58 100644
--- a/src/stats.c
+++ b/src/stats.c
@@ -259,7 +259,7 @@ const struct name_desc stat_fields[ST_F_TOTAL_FIELDS] = {
[ST_F_USED_CONN_CUR] = { .name = "used_conn_cur",   
.desc = "Current number of connections in use"},
[ST_F_NEED_CONN_EST] = { .name = "need_conn_est",   
.desc = "Estimated needed number of connections"},
[ST_F_UWEIGHT]   = { .name = "uweight", 
.desc = "Server's user weight, or sum of active servers' user weights 
for a backend" },
-   [ST_F_AGG_SRV_CHECK_STATUS]  = { .name = 
"agg_server_check_status", .desc = "Backend's aggregated gauge of servers' 
state check status" },
+   [ST_F_AGG_SRV_STATUS ]   = { .name = "agg_server_status",   
.desc = "Backend's aggregated gauge of servers' status" },
 };
 
 /* one line of info */
@@ -2673,7 +2673,7 @@ int stats_fill_be_stats(struct proxy *px, int flags, 
struct field *stats, int le
chunk_appendf(out, " (%d/%d)", nbup, 
nbsrv);
metric = mkf_str(FO_STATUS, fld);
break;
-   case ST_F_AGG_SRV_CHECK_STATUS:
+   case ST_F_AGG_SRV_STATUS :
metric = mkf_u32(FN_GAUGE, 0);
break;
case ST_F_WEIGHT:
-- 
2.25.1



[PATCH 2/2] [PATCH] MINOR: promex: reintroduce haproxy_backend_agg_server_check_status

2022-11-27 Thread Cedric Paillet
This patch reintroduces haproxy_backend_agg_server_check_status metric
as in 42d7c402d but with the right data source.
---
 addons/promex/service-prometheus.c | 26 ++
 include/haproxy/stats-t.h  |  1 +
 src/stats.c|  4 
 3 files changed, 31 insertions(+)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index c24f0895c..265ba97c4 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -302,6 +302,7 @@ const struct promex_metric 
promex_st_metrics[ST_F_TOTAL_FIELDS] = {
[ST_F_NEED_CONN_EST]= { .n = IST("need_connections_current"),   
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_SRV_METRIC) },
[ST_F_UWEIGHT]  = { .n = IST("uweight"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) },
[ST_F_AGG_SRV_STATUS ]  = { .n = IST("agg_server_status"),  
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
+   [ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"),
  .type = PROMEX_MT_GAUGE,.flags = (
   PROMEX_FL_BACK_METRIC   ) },
 };
 
 /* Description of overridden stats fields */
@@ -811,6 +812,7 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
double secs;
enum promex_back_state bkd_state;
enum promex_srv_state srv_state;
+   enum healthcheck_status srv_check_status;
 
for (;ctx->field_num < ST_F_TOTAL_FIELDS; ctx->field_num++) {
if (!(promex_st_metrics[ctx->field_num].flags & ctx->flags))
@@ -819,6 +821,8 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
while (ctx->px) {
struct promex_label labels[PROMEX_MAX_LABELS-1] = {};
unsigned int srv_state_count[PROMEX_SRV_STATE_COUNT] = 
{ 0 };
+   unsigned int srv_check_count[HCHK_STATUS_SIZE] = { 0 };
+   const char *check_state;
 
px = ctx->px;
 
@@ -852,6 +856,28 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
}
ctx->obj_state = 0;
goto next_px;
+   case ST_F_AGG_SRV_CHECK_STATUS:
+   if (!px->srv)
+   goto next_px;
+   sv = px->srv;
+   while (sv) {
+   srv_check_status = 
sv->check.status;
+   
srv_check_count[srv_check_status] += 1;
+   sv = sv->next;
+   }
+   for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
+   if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)
+   continue;
+   val = mkf_u32(FO_STATUS, 
srv_check_count[ctx->obj_state]);
+   check_state = 
get_check_status_info(ctx->obj_state);
+   labels[1].name = ist("state");
+   labels[1].value = 
ist(check_state);
+   if (!promex_dump_metric(appctx, 
htx, prefix, _st_metrics[ctx->field_num],
+   , 
labels, , max))
+   goto full;
+   }
+   ctx->obj_state = 0;
+   goto next_px;
case ST_F_STATUS:
bkd_state = ((px->lbprm.tot_weight > 0 
|| !px->srv) ? 1 : 0);
for (; ctx->obj_state < 
PROMEX_BACK_STATE_COUNT; ctx->obj_state++) {
diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h
index babea31e1..66bacd065 100644
--- a/include/haproxy/stats-t.h
+++ b/include/haproxy/stats-t.h
@@ -456,6 +456,7 @@ enum stat_field {
ST_F_NEED_CONN_EST,
ST_F_UWEIGHT,
ST_F_AGG_SRV_STATUS ,
+   ST_F_AGG_SRV_CHECK_STATUS,
 
/* must always be the last one */
ST_F_TOTAL_FIELDS
diff --git 

[PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status

2022-11-27 Thread Cedric Paillet
As described in github issue #1312, the first intention of patch 42d7c402d
was to aggregate haproxy_server_check_status. But we aggregated
haproxy_server_status instead.
To fix that, rename haproxy_backend_agg_server_check_status to
haproxy_backend_agg_server_status and use right data source for
haproxy_backend_agg_server_check_status.

Cedric Paillet (2):
  BUG/MINOR: promex: rename haproxy_backend_agg_server_check_status
  MINOR: promex: reintroduce haproxy_backend_agg_server_check_status

 addons/promex/service-prometheus.c | 28 +++-
 include/haproxy/stats-t.h  |  1 +
 src/stats.c|  4 
 3 files changed, 32 insertions(+), 1 deletion(-)

-- 
2.25.1