Re: [External] Re: [RFC PATCH] Improved moving averages

Marcin Deranek Sat, 16 May 2020 07:32:24 -0700

Hi Willy,

On Fri, May 15, 2020 at 2:45 PM Willy Tarreau <[email protected]> wrote:


> What I'd like to see for the ctime etc is:
>   - average over the number of samples like you did, but still per request
> ;
>   - possibly new sets of per-connection metrics if these have some values
>     to you (I guess they do).
>

I think these which we have are sufficient: for HTTP mode we measure on a
per-request basis and for TCP mode we measure on connection basis (I
hope cum_lbconn is the right counter for TCP).



> What will happen with two sets of metrics is that both will be exactly
> identical in TCP. But that's precisely because HTTP makes a difference that
> we want to split them!
>

I don't want to change this, so I kept existing metrics (ctime, ttime
etc.), but as I mentioned above they are calculated a bit differently for
HTTP and TCP modes.


> > I don't mind creating additional metrics if they make sense of course.
> One
> > of them (which could help to avoid exceptions) would be req_tot for a
> > server (it's available for backend).
>
> Wow, I thought we already had it. When I say that we really need to invest
> time digging into the existing metrics, I'm not kidding. Yes sure, feel
> free
> to add this one as well.
>

It feels like a hack by summing up all hrsp_* metrics. Patch 0002 addresses
that.


> > To dynamically scale n, req_tot is
> > much more suitable than lbconn for http mode (lbconn is incremented much
> > earlier than req_tot creating a time window when metrics can be
> incorrectly
> > calculated leading to wrong averages).
>
> It's not the worst issue, the worst issue is that lbconn is only
> incremented
> on load balanced connections; as soo as you have cookie-based stickiness or
> use-server rules, lbconn doesn't move.
>
> I've just looked on the stats page and I'm seeing a "Cum HTTP responses"
> field for the servers in the tooltip, which is accurately incremented. I
> just don't know what counter it is, I would have thought it's the request
> counter.
>

See above.


> > Of course if you still think it's a
> > good idea to separate them (personally I'm not convinced about this
> unless
> > you don't want to change behaviour of existing metric etc.) I can make
> new
> > metrics eg. ctime_avg.
>
> Just to be sure we're talking about the same thing, I'm convinced about the
> benefit of splitting such averages per-request and per-connection, and of
> changing them to use your sliding average.
>

I kept things as-is, so HTTP show *time metric per-request and TCP per
connection.


> I'm pretty sure I tried the two and found that rounding up was better.
> I think it's related to the loss of precision caused by exponentiating
> the old value, which would disappear after the second sample if only
> rounded. But I could be wrong, that's a bit old now. I'd suggest to
> proceed as you find the most accurate between the two.
>

Left rounding up.



> > > I'd also ask you to split your patch in 2 :
> > >
> > >   - one which improves the freq_ctr with the new function(s)
> > >   - one which adds the new metric, its update and reporting to the
> > >     various call places.
> > >
> > > In addition, I think there will be less changes with a separate metric.
> > >
> >
> > Will do. Thank you for the review.
>
> You're welcome, thanks for this!
>

Actually I split this into 3 (see attachments):
1. Adds improved swrate_rate_dynamic function
2. Adds proper req_tot metric for a server (as we have for frontend/backend)
3. Updates the code to use the new function for existing *time metrics.
They should behave the same except they should be much more accurate at the
very beginning. This one depends on 1 & 2.
Regards,

Marcin Deranek
Senior System Administrator

Booking.com B.V.
Singel 540 Amsterdam 1017AZ Netherlands
Direct +31207243362
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)

From af5e10eeecd9c146861ff7bd8bcaec041eb77d5d Mon Sep 17 00:00:00 2001
From: Marcin Deranek <[email protected]>
Date: Fri, 15 May 2020 18:26:18 +0200
Subject: [PATCH 1/3] MINOR stats: Prepare for more accurate moving averages

Add swrate_add_dynamic function which is similar to swrate_add, but more
accurate when calculating moving averages when not enough samples have
been processed yet.
---
 include/proto/freq_ctr.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/proto/freq_ctr.h b/include/proto/freq_ctr.h
index ff13ea8f4..63febb24a 100644
--- a/include/proto/freq_ctr.h
+++ b/include/proto/freq_ctr.h
@@ -263,6 +263,24 @@ static inline unsigned int swrate_add(unsigned int *sum, unsigned int n, unsigne
 	return new_sum;
 }
 
+/* Adds sample value <v> to sliding window sum <sum> configured for <n> samples.
+ * The sample is returned. Better if <n> is a power of two. This function is
+ * thread-safe.
+ * This function should give better accuracy than swrate_add when number of
+ * samples collected is lower than nominal window size. In such circumstances
+ * <n> should be set to 0.
+ */
+static inline unsigned int swrate_add_dynamic(unsigned int *sum, unsigned int n, unsigned int v)
+{
+	unsigned int new_sum, old_sum;
+
+	old_sum = *sum;
+	do {
+		new_sum = old_sum - (n ? (old_sum + n - 1) / n : 0) + v;
+	} while (!_HA_ATOMIC_CAS(sum, &old_sum, new_sum));
+	return new_sum;
+}
+
 /* Adds sample value <v> spanning <s> samples to sliding window sum <sum>
  * configured for <n> samples, where <n> is supposed to be "much larger" than
  * <s>. The sample is returned. Better if <n> is a power of two. Note that this
-- 
2.26.2

From bf51dd11b09aab65e29cffceefe51fbd53ec479c Mon Sep 17 00:00:00 2001
From: Marcin Deranek <[email protected]>
Date: Fri, 15 May 2020 18:32:51 +0200
Subject: [PATCH 2/3] MINOR stats: Expose native cum_req metric for a server

Expose native cum_req metric for a server: so far it was calculated as a
sum or all responses. Rename it from Cum. HTTP Responses to Cum. HTTP
Requests to be consistent with Frontend and Backend.
---
 .../prometheus-exporter/service-prometheus.c  |  5 +++
 src/http_ana.c                                |  4 ++-
 src/stats.c                                   | 32 +++++++++----------
 3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/contrib/prometheus-exporter/service-prometheus.c b/contrib/prometheus-exporter/service-prometheus.c
index 54770ef5b..f7542d26b 100644
--- a/contrib/prometheus-exporter/service-prometheus.c
+++ b/contrib/prometheus-exporter/service-prometheus.c
@@ -2061,6 +2061,11 @@ static int promex_dump_srv_metrics(struct appctx *appctx, struct htx *htx)
 					case ST_F_LBTOT:
 						metric = mkf_u64(FN_COUNTER, sv->counters.cum_lbconn);
 						break;
+					case ST_F_REQ_TOT:
+						if (px->mode != PR_MODE_HTTP)
+							goto next_px;
+						metric = mkf_u64(FN_COUNTER, sv->counters.p.http.cum_req);
+						break;
 					case ST_F_HRSP_1XX:
 						if (px->mode != PR_MODE_HTTP)
 							goto next_px;
diff --git a/src/http_ana.c b/src/http_ana.c
index f6e0eca3b..5492b54fe 100644
--- a/src/http_ana.c
+++ b/src/http_ana.c
@@ -1721,8 +1721,10 @@ int http_wait_for_response(struct stream *s, struct channel *rep, int an_bit)
 	if (n == 4)
 		stream_inc_http_err_ctr(s);
 
-	if (objt_server(s->target))
+	if (objt_server(s->target)) {
 		_HA_ATOMIC_ADD(&__objt_server(s->target)->counters.p.http.rsp[n], 1);
+		_HA_ATOMIC_ADD(&__objt_server(s->target)->counters.p.http.cum_req, 1);
+	}
 
 	/* Adjust server's health based on status code. Note: status codes 501
 	 * and 505 are triggered on demand by client request, so we must not
diff --git a/src/stats.c b/src/stats.c
index f76fd3790..754819734 100644
--- a/src/stats.c
+++ b/src/stats.c
@@ -1016,19 +1016,10 @@ static int stats_dump_fields_html(struct buffer *out,
 
 		/* http response (via hover): 1xx, 2xx, 3xx, 4xx, 5xx, other */
 		if (strcmp(field_str(stats, ST_F_MODE), "http") == 0) {
-			unsigned long long tot;
-
-			tot  = stats[ST_F_HRSP_OTHER].u.u64;
-			tot += stats[ST_F_HRSP_1XX].u.u64;
-			tot += stats[ST_F_HRSP_2XX].u.u64;
-			tot += stats[ST_F_HRSP_3XX].u.u64;
-			tot += stats[ST_F_HRSP_4XX].u.u64;
-			tot += stats[ST_F_HRSP_5XX].u.u64;
-
 			chunk_appendf(out,
 			              "<tr><th>New connections:</th><td>%s</td></tr>"
 			              "<tr><th>Reused connections:</th><td>%s</td><td>(%d%%)</td></tr>"
-			              "<tr><th>Cum. HTTP responses:</th><td>%s</td></tr>"
+			              "<tr><th>Cum. HTTP requests:</th><td>%s</td></tr>"
 			              "<tr><th>- HTTP 1xx responses:</th><td>%s</td><td>(%d%%)</td></tr>"
 			              "<tr><th>- HTTP 2xx responses:</th><td>%s</td><td>(%d%%)</td></tr>"
 			              "<tr><th>- HTTP 3xx responses:</th><td>%s</td><td>(%d%%)</td></tr>"
@@ -1042,13 +1033,19 @@ static int stats_dump_fields_html(struct buffer *out,
 			              U2H(stats[ST_F_REUSE].u.u64),
 			              (stats[ST_F_CONNECT].u.u64 + stats[ST_F_REUSE].u.u64) ?
 			              (int)(100 * stats[ST_F_REUSE].u.u64 / (stats[ST_F_CONNECT].u.u64 + stats[ST_F_REUSE].u.u64)) : 0,
-			              U2H(tot),
-			              U2H(stats[ST_F_HRSP_1XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_1XX].u.u64 / tot) : 0,
-			              U2H(stats[ST_F_HRSP_2XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_2XX].u.u64 / tot) : 0,
-			              U2H(stats[ST_F_HRSP_3XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_3XX].u.u64 / tot) : 0,
-			              U2H(stats[ST_F_HRSP_4XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_4XX].u.u64 / tot) : 0,
-			              U2H(stats[ST_F_HRSP_5XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_5XX].u.u64 / tot) : 0,
-			              U2H(stats[ST_F_HRSP_OTHER].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_OTHER].u.u64 / tot) : 0,
+			              U2H(stats[ST_F_REQ_TOT].u.u64),
+			              U2H(stats[ST_F_HRSP_1XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ?
+			              (int)(100 * stats[ST_F_HRSP_1XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0,
+			              U2H(stats[ST_F_HRSP_2XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ?
+			              (int)(100 * stats[ST_F_HRSP_2XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0,
+			              U2H(stats[ST_F_HRSP_3XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ?
+			              (int)(100 * stats[ST_F_HRSP_3XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0,
+			              U2H(stats[ST_F_HRSP_4XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ?
+			              (int)(100 * stats[ST_F_HRSP_4XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0,
+			              U2H(stats[ST_F_HRSP_5XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ?
+			              (int)(100 * stats[ST_F_HRSP_5XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0,
+			              U2H(stats[ST_F_HRSP_OTHER].u.u64), stats[ST_F_REQ_TOT].u.u64 ?
+			              (int)(100 * stats[ST_F_HRSP_OTHER].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0,
 			              U2H(stats[ST_F_WREW].u.u64),
 			              U2H(stats[ST_F_EINT].u.u64));
 		}
@@ -1796,6 +1793,7 @@ int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags,
 
 	/* http response: 1xx, 2xx, 3xx, 4xx, 5xx, other */
 	if (px->mode == PR_MODE_HTTP) {
+		stats[ST_F_REQ_TOT]    = mkf_u64(FN_COUNTER, sv->counters.p.http.cum_req);
 		stats[ST_F_HRSP_1XX]   = mkf_u64(FN_COUNTER, sv->counters.p.http.rsp[1]);
 		stats[ST_F_HRSP_2XX]   = mkf_u64(FN_COUNTER, sv->counters.p.http.rsp[2]);
 		stats[ST_F_HRSP_3XX]   = mkf_u64(FN_COUNTER, sv->counters.p.http.rsp[3]);
-- 
2.26.2

From fad692949f902fb4f3240ee9eba30b9a64a171ba Mon Sep 17 00:00:00 2001
From: Marcin Deranek <[email protected]>
Date: Fri, 15 May 2020 20:02:40 +0200
Subject: [PATCH 3/3] MEDIUM stats: Enable more accurate moving average
 calculation for stats

Enable more accurate generation of moving averages for partially
populated samples window.
---
 src/stats.c  | 29 +++++++++++++++++++++--------
 src/stream.c | 21 +++++++++++++--------
 2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/stats.c b/src/stats.c
index 754819734..1413d3dcb 100644
--- a/src/stats.c
+++ b/src/stats.c
@@ -1610,6 +1610,8 @@ int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags,
 	struct buffer *out = get_trash_chunk();
 	enum srv_stats_state state;
 	char *fld_status;
+	long long srv_samples_counter;
+	unsigned int srv_samples_window = TIME_STATS_SAMPLES;
 
 	if (len < ST_F_TOTAL_FIELDS)
 		return 0;
@@ -1809,10 +1811,14 @@ int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags,
 	stats[ST_F_SRV_ABRT] = mkf_u64(FN_COUNTER, sv->counters.srv_aborts);
 	stats[ST_F_LASTSESS] = mkf_s32(FN_AGE, srv_lastsession(sv));
 
-	stats[ST_F_QTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.q_time, TIME_STATS_SAMPLES));
-	stats[ST_F_CTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.c_time, TIME_STATS_SAMPLES));
-	stats[ST_F_RTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, TIME_STATS_SAMPLES));
-	stats[ST_F_TTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.t_time, TIME_STATS_SAMPLES));
+	srv_samples_counter = (px->mode == PR_MODE_HTTP) ? sv->counters.p.http.cum_req : sv->counters.cum_lbconn;
+	if (srv_samples_counter < TIME_STATS_SAMPLES && srv_samples_counter > 0)
+		srv_samples_window = srv_samples_counter;
+
+	stats[ST_F_QTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.q_time, srv_samples_window));
+	stats[ST_F_CTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.c_time, srv_samples_window));
+	stats[ST_F_RTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, srv_samples_window));
+	stats[ST_F_TTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.t_time, srv_samples_window));
 
 	stats[ST_F_QT_MAX] = mkf_u32(FN_MAX, sv->counters.qtime_max);
 	stats[ST_F_CT_MAX] = mkf_u32(FN_MAX, sv->counters.ctime_max);
@@ -1870,6 +1876,9 @@ static int stats_dump_sv_stats(struct stream_interface *si, struct proxy *px, st
  */
 int stats_fill_be_stats(struct proxy *px, int flags, struct field *stats, int len)
 {
+	long long be_samples_counter;
+	unsigned int be_samples_window = TIME_STATS_SAMPLES;
+
 	if (len < ST_F_TOTAL_FIELDS)
 		return 0;
 
@@ -1942,10 +1951,14 @@ int stats_fill_be_stats(struct proxy *px, int flags, struct field *stats, int le
 	stats[ST_F_COMP_RSP]     = mkf_u64(FN_COUNTER, px->be_counters.p.http.comp_rsp);
 	stats[ST_F_LASTSESS]     = mkf_s32(FN_AGE, be_lastsession(px));
 
-	stats[ST_F_QTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.q_time, TIME_STATS_SAMPLES));
-	stats[ST_F_CTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.c_time, TIME_STATS_SAMPLES));
-	stats[ST_F_RTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.d_time, TIME_STATS_SAMPLES));
-	stats[ST_F_TTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.t_time, TIME_STATS_SAMPLES));
+	be_samples_counter = (px->mode == PR_MODE_HTTP) ? px->be_counters.p.http.cum_req : px->be_counters.cum_lbconn;
+	if (be_samples_counter < TIME_STATS_SAMPLES && be_samples_counter > 0)
+		be_samples_window = be_samples_counter;
+
+	stats[ST_F_QTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.q_time, be_samples_window));
+	stats[ST_F_CTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.c_time, be_samples_window));
+	stats[ST_F_RTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.d_time, be_samples_window));
+	stats[ST_F_TTIME]        = mkf_u32(FN_AVG, swrate_avg(px->be_counters.t_time, be_samples_window));
 
 	stats[ST_F_QT_MAX]       = mkf_u32(FN_MAX, px->be_counters.qtime_max);
 	stats[ST_F_CT_MAX]       = mkf_u32(FN_MAX, px->be_counters.ctime_max);
diff --git a/src/stream.c b/src/stream.c
index 4fa41a71d..1b8a86870 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -2441,6 +2441,7 @@ void stream_update_time_stats(struct stream *s)
 	int t_data;
 	int t_close;
 	struct server *srv;
+	unsigned int samples_window;
 
 	t_request = 0;
 	t_queue   = s->logs.t_queue;
@@ -2463,19 +2464,23 @@ void stream_update_time_stats(struct stream *s)
 
 	srv = objt_server(s->target);
 	if (srv) {
-		swrate_add(&srv->counters.q_time, TIME_STATS_SAMPLES, t_queue);
-		swrate_add(&srv->counters.c_time, TIME_STATS_SAMPLES, t_connect);
-		swrate_add(&srv->counters.d_time, TIME_STATS_SAMPLES, t_data);
-		swrate_add(&srv->counters.t_time, TIME_STATS_SAMPLES, t_close);
+		samples_window = (((s->be->mode == PR_MODE_HTTP) ?
+			srv->counters.p.http.cum_req : srv->counters.cum_lbconn) > TIME_STATS_SAMPLES) ? TIME_STATS_SAMPLES : 0;
+		swrate_add_dynamic(&srv->counters.q_time, samples_window, t_queue);
+		swrate_add_dynamic(&srv->counters.c_time, samples_window, t_connect);
+		swrate_add_dynamic(&srv->counters.d_time, samples_window, t_data);
+		swrate_add_dynamic(&srv->counters.t_time, samples_window, t_close);
 		HA_ATOMIC_UPDATE_MAX(&srv->counters.qtime_max, t_queue);
 		HA_ATOMIC_UPDATE_MAX(&srv->counters.ctime_max, t_connect);
 		HA_ATOMIC_UPDATE_MAX(&srv->counters.dtime_max, t_data);
 		HA_ATOMIC_UPDATE_MAX(&srv->counters.ttime_max, t_close);
 	}
-	swrate_add(&s->be->be_counters.q_time, TIME_STATS_SAMPLES, t_queue);
-	swrate_add(&s->be->be_counters.c_time, TIME_STATS_SAMPLES, t_connect);
-	swrate_add(&s->be->be_counters.d_time, TIME_STATS_SAMPLES, t_data);
-	swrate_add(&s->be->be_counters.t_time, TIME_STATS_SAMPLES, t_close);
+	samples_window = (((s->be->mode == PR_MODE_HTTP) ?
+		s->be->be_counters.p.http.cum_req : s->be->be_counters.cum_lbconn) > TIME_STATS_SAMPLES) ? TIME_STATS_SAMPLES : 0;
+	swrate_add_dynamic(&s->be->be_counters.q_time, samples_window, t_queue);
+	swrate_add_dynamic(&s->be->be_counters.c_time, samples_window, t_connect);
+	swrate_add_dynamic(&s->be->be_counters.d_time, samples_window, t_data);
+	swrate_add_dynamic(&s->be->be_counters.t_time, samples_window, t_close);
 	HA_ATOMIC_UPDATE_MAX(&s->be->be_counters.qtime_max, t_queue);
 	HA_ATOMIC_UPDATE_MAX(&s->be->be_counters.ctime_max, t_connect);
 	HA_ATOMIC_UPDATE_MAX(&s->be->be_counters.dtime_max, t_data);
-- 
2.26.2

Re: [External] Re: [RFC PATCH] Improved moving averages

Reply via email to