Hi Willy, On Fri, May 15, 2020 at 2:45 PM Willy Tarreau <[email protected]> wrote:
> What I'd like to see for the ctime etc is: > - average over the number of samples like you did, but still per request > ; > - possibly new sets of per-connection metrics if these have some values > to you (I guess they do). > I think these which we have are sufficient: for HTTP mode we measure on a per-request basis and for TCP mode we measure on connection basis (I hope cum_lbconn is the right counter for TCP). > What will happen with two sets of metrics is that both will be exactly > identical in TCP. But that's precisely because HTTP makes a difference that > we want to split them! > I don't want to change this, so I kept existing metrics (ctime, ttime etc.), but as I mentioned above they are calculated a bit differently for HTTP and TCP modes. > > I don't mind creating additional metrics if they make sense of course. > One > > of them (which could help to avoid exceptions) would be req_tot for a > > server (it's available for backend). > > Wow, I thought we already had it. When I say that we really need to invest > time digging into the existing metrics, I'm not kidding. Yes sure, feel > free > to add this one as well. > It feels like a hack by summing up all hrsp_* metrics. Patch 0002 addresses that. > > To dynamically scale n, req_tot is > > much more suitable than lbconn for http mode (lbconn is incremented much > > earlier than req_tot creating a time window when metrics can be > incorrectly > > calculated leading to wrong averages). > > It's not the worst issue, the worst issue is that lbconn is only > incremented > on load balanced connections; as soo as you have cookie-based stickiness or > use-server rules, lbconn doesn't move. > > I've just looked on the stats page and I'm seeing a "Cum HTTP responses" > field for the servers in the tooltip, which is accurately incremented. I > just don't know what counter it is, I would have thought it's the request > counter. > See above. > > Of course if you still think it's a > > good idea to separate them (personally I'm not convinced about this > unless > > you don't want to change behaviour of existing metric etc.) I can make > new > > metrics eg. ctime_avg. > > Just to be sure we're talking about the same thing, I'm convinced about the > benefit of splitting such averages per-request and per-connection, and of > changing them to use your sliding average. > I kept things as-is, so HTTP show *time metric per-request and TCP per connection. > I'm pretty sure I tried the two and found that rounding up was better. > I think it's related to the loss of precision caused by exponentiating > the old value, which would disappear after the second sample if only > rounded. But I could be wrong, that's a bit old now. I'd suggest to > proceed as you find the most accurate between the two. > Left rounding up. > > > I'd also ask you to split your patch in 2 : > > > > > > - one which improves the freq_ctr with the new function(s) > > > - one which adds the new metric, its update and reporting to the > > > various call places. > > > > > > In addition, I think there will be less changes with a separate metric. > > > > > > > Will do. Thank you for the review. > > You're welcome, thanks for this! > Actually I split this into 3 (see attachments): 1. Adds improved swrate_rate_dynamic function 2. Adds proper req_tot metric for a server (as we have for frontend/backend) 3. Updates the code to use the new function for existing *time metrics. They should behave the same except they should be much more accurate at the very beginning. This one depends on 1 & 2. Regards, Marcin Deranek Senior System Administrator Booking.com B.V. Singel 540 Amsterdam 1017AZ Netherlands Direct +31207243362 [image: Booking.com] <https://www.booking.com/> Making it easier for everyone to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
From af5e10eeecd9c146861ff7bd8bcaec041eb77d5d Mon Sep 17 00:00:00 2001 From: Marcin Deranek <[email protected]> Date: Fri, 15 May 2020 18:26:18 +0200 Subject: [PATCH 1/3] MINOR stats: Prepare for more accurate moving averages Add swrate_add_dynamic function which is similar to swrate_add, but more accurate when calculating moving averages when not enough samples have been processed yet. --- include/proto/freq_ctr.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/include/proto/freq_ctr.h b/include/proto/freq_ctr.h index ff13ea8f4..63febb24a 100644 --- a/include/proto/freq_ctr.h +++ b/include/proto/freq_ctr.h @@ -263,6 +263,24 @@ static inline unsigned int swrate_add(unsigned int *sum, unsigned int n, unsigne return new_sum; } +/* Adds sample value <v> to sliding window sum <sum> configured for <n> samples. + * The sample is returned. Better if <n> is a power of two. This function is + * thread-safe. + * This function should give better accuracy than swrate_add when number of + * samples collected is lower than nominal window size. In such circumstances + * <n> should be set to 0. + */ +static inline unsigned int swrate_add_dynamic(unsigned int *sum, unsigned int n, unsigned int v) +{ + unsigned int new_sum, old_sum; + + old_sum = *sum; + do { + new_sum = old_sum - (n ? (old_sum + n - 1) / n : 0) + v; + } while (!_HA_ATOMIC_CAS(sum, &old_sum, new_sum)); + return new_sum; +} + /* Adds sample value <v> spanning <s> samples to sliding window sum <sum> * configured for <n> samples, where <n> is supposed to be "much larger" than * <s>. The sample is returned. Better if <n> is a power of two. Note that this -- 2.26.2
From bf51dd11b09aab65e29cffceefe51fbd53ec479c Mon Sep 17 00:00:00 2001 From: Marcin Deranek <[email protected]> Date: Fri, 15 May 2020 18:32:51 +0200 Subject: [PATCH 2/3] MINOR stats: Expose native cum_req metric for a server Expose native cum_req metric for a server: so far it was calculated as a sum or all responses. Rename it from Cum. HTTP Responses to Cum. HTTP Requests to be consistent with Frontend and Backend. --- .../prometheus-exporter/service-prometheus.c | 5 +++ src/http_ana.c | 4 ++- src/stats.c | 32 +++++++++---------- 3 files changed, 23 insertions(+), 18 deletions(-) diff --git a/contrib/prometheus-exporter/service-prometheus.c b/contrib/prometheus-exporter/service-prometheus.c index 54770ef5b..f7542d26b 100644 --- a/contrib/prometheus-exporter/service-prometheus.c +++ b/contrib/prometheus-exporter/service-prometheus.c @@ -2061,6 +2061,11 @@ static int promex_dump_srv_metrics(struct appctx *appctx, struct htx *htx) case ST_F_LBTOT: metric = mkf_u64(FN_COUNTER, sv->counters.cum_lbconn); break; + case ST_F_REQ_TOT: + if (px->mode != PR_MODE_HTTP) + goto next_px; + metric = mkf_u64(FN_COUNTER, sv->counters.p.http.cum_req); + break; case ST_F_HRSP_1XX: if (px->mode != PR_MODE_HTTP) goto next_px; diff --git a/src/http_ana.c b/src/http_ana.c index f6e0eca3b..5492b54fe 100644 --- a/src/http_ana.c +++ b/src/http_ana.c @@ -1721,8 +1721,10 @@ int http_wait_for_response(struct stream *s, struct channel *rep, int an_bit) if (n == 4) stream_inc_http_err_ctr(s); - if (objt_server(s->target)) + if (objt_server(s->target)) { _HA_ATOMIC_ADD(&__objt_server(s->target)->counters.p.http.rsp[n], 1); + _HA_ATOMIC_ADD(&__objt_server(s->target)->counters.p.http.cum_req, 1); + } /* Adjust server's health based on status code. Note: status codes 501 * and 505 are triggered on demand by client request, so we must not diff --git a/src/stats.c b/src/stats.c index f76fd3790..754819734 100644 --- a/src/stats.c +++ b/src/stats.c @@ -1016,19 +1016,10 @@ static int stats_dump_fields_html(struct buffer *out, /* http response (via hover): 1xx, 2xx, 3xx, 4xx, 5xx, other */ if (strcmp(field_str(stats, ST_F_MODE), "http") == 0) { - unsigned long long tot; - - tot = stats[ST_F_HRSP_OTHER].u.u64; - tot += stats[ST_F_HRSP_1XX].u.u64; - tot += stats[ST_F_HRSP_2XX].u.u64; - tot += stats[ST_F_HRSP_3XX].u.u64; - tot += stats[ST_F_HRSP_4XX].u.u64; - tot += stats[ST_F_HRSP_5XX].u.u64; - chunk_appendf(out, "<tr><th>New connections:</th><td>%s</td></tr>" "<tr><th>Reused connections:</th><td>%s</td><td>(%d%%)</td></tr>" - "<tr><th>Cum. HTTP responses:</th><td>%s</td></tr>" + "<tr><th>Cum. HTTP requests:</th><td>%s</td></tr>" "<tr><th>- HTTP 1xx responses:</th><td>%s</td><td>(%d%%)</td></tr>" "<tr><th>- HTTP 2xx responses:</th><td>%s</td><td>(%d%%)</td></tr>" "<tr><th>- HTTP 3xx responses:</th><td>%s</td><td>(%d%%)</td></tr>" @@ -1042,13 +1033,19 @@ static int stats_dump_fields_html(struct buffer *out, U2H(stats[ST_F_REUSE].u.u64), (stats[ST_F_CONNECT].u.u64 + stats[ST_F_REUSE].u.u64) ? (int)(100 * stats[ST_F_REUSE].u.u64 / (stats[ST_F_CONNECT].u.u64 + stats[ST_F_REUSE].u.u64)) : 0, - U2H(tot), - U2H(stats[ST_F_HRSP_1XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_1XX].u.u64 / tot) : 0, - U2H(stats[ST_F_HRSP_2XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_2XX].u.u64 / tot) : 0, - U2H(stats[ST_F_HRSP_3XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_3XX].u.u64 / tot) : 0, - U2H(stats[ST_F_HRSP_4XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_4XX].u.u64 / tot) : 0, - U2H(stats[ST_F_HRSP_5XX].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_5XX].u.u64 / tot) : 0, - U2H(stats[ST_F_HRSP_OTHER].u.u64), tot ? (int)(100 * stats[ST_F_HRSP_OTHER].u.u64 / tot) : 0, + U2H(stats[ST_F_REQ_TOT].u.u64), + U2H(stats[ST_F_HRSP_1XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ? + (int)(100 * stats[ST_F_HRSP_1XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0, + U2H(stats[ST_F_HRSP_2XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ? + (int)(100 * stats[ST_F_HRSP_2XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0, + U2H(stats[ST_F_HRSP_3XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ? + (int)(100 * stats[ST_F_HRSP_3XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0, + U2H(stats[ST_F_HRSP_4XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ? + (int)(100 * stats[ST_F_HRSP_4XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0, + U2H(stats[ST_F_HRSP_5XX].u.u64), stats[ST_F_REQ_TOT].u.u64 ? + (int)(100 * stats[ST_F_HRSP_5XX].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0, + U2H(stats[ST_F_HRSP_OTHER].u.u64), stats[ST_F_REQ_TOT].u.u64 ? + (int)(100 * stats[ST_F_HRSP_OTHER].u.u64 / stats[ST_F_REQ_TOT].u.u64) : 0, U2H(stats[ST_F_WREW].u.u64), U2H(stats[ST_F_EINT].u.u64)); } @@ -1796,6 +1793,7 @@ int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags, /* http response: 1xx, 2xx, 3xx, 4xx, 5xx, other */ if (px->mode == PR_MODE_HTTP) { + stats[ST_F_REQ_TOT] = mkf_u64(FN_COUNTER, sv->counters.p.http.cum_req); stats[ST_F_HRSP_1XX] = mkf_u64(FN_COUNTER, sv->counters.p.http.rsp[1]); stats[ST_F_HRSP_2XX] = mkf_u64(FN_COUNTER, sv->counters.p.http.rsp[2]); stats[ST_F_HRSP_3XX] = mkf_u64(FN_COUNTER, sv->counters.p.http.rsp[3]); -- 2.26.2
From fad692949f902fb4f3240ee9eba30b9a64a171ba Mon Sep 17 00:00:00 2001 From: Marcin Deranek <[email protected]> Date: Fri, 15 May 2020 20:02:40 +0200 Subject: [PATCH 3/3] MEDIUM stats: Enable more accurate moving average calculation for stats Enable more accurate generation of moving averages for partially populated samples window. --- src/stats.c | 29 +++++++++++++++++++++-------- src/stream.c | 21 +++++++++++++-------- 2 files changed, 34 insertions(+), 16 deletions(-) diff --git a/src/stats.c b/src/stats.c index 754819734..1413d3dcb 100644 --- a/src/stats.c +++ b/src/stats.c @@ -1610,6 +1610,8 @@ int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags, struct buffer *out = get_trash_chunk(); enum srv_stats_state state; char *fld_status; + long long srv_samples_counter; + unsigned int srv_samples_window = TIME_STATS_SAMPLES; if (len < ST_F_TOTAL_FIELDS) return 0; @@ -1809,10 +1811,14 @@ int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags, stats[ST_F_SRV_ABRT] = mkf_u64(FN_COUNTER, sv->counters.srv_aborts); stats[ST_F_LASTSESS] = mkf_s32(FN_AGE, srv_lastsession(sv)); - stats[ST_F_QTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.q_time, TIME_STATS_SAMPLES)); - stats[ST_F_CTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.c_time, TIME_STATS_SAMPLES)); - stats[ST_F_RTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, TIME_STATS_SAMPLES)); - stats[ST_F_TTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.t_time, TIME_STATS_SAMPLES)); + srv_samples_counter = (px->mode == PR_MODE_HTTP) ? sv->counters.p.http.cum_req : sv->counters.cum_lbconn; + if (srv_samples_counter < TIME_STATS_SAMPLES && srv_samples_counter > 0) + srv_samples_window = srv_samples_counter; + + stats[ST_F_QTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.q_time, srv_samples_window)); + stats[ST_F_CTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.c_time, srv_samples_window)); + stats[ST_F_RTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, srv_samples_window)); + stats[ST_F_TTIME] = mkf_u32(FN_AVG, swrate_avg(sv->counters.t_time, srv_samples_window)); stats[ST_F_QT_MAX] = mkf_u32(FN_MAX, sv->counters.qtime_max); stats[ST_F_CT_MAX] = mkf_u32(FN_MAX, sv->counters.ctime_max); @@ -1870,6 +1876,9 @@ static int stats_dump_sv_stats(struct stream_interface *si, struct proxy *px, st */ int stats_fill_be_stats(struct proxy *px, int flags, struct field *stats, int len) { + long long be_samples_counter; + unsigned int be_samples_window = TIME_STATS_SAMPLES; + if (len < ST_F_TOTAL_FIELDS) return 0; @@ -1942,10 +1951,14 @@ int stats_fill_be_stats(struct proxy *px, int flags, struct field *stats, int le stats[ST_F_COMP_RSP] = mkf_u64(FN_COUNTER, px->be_counters.p.http.comp_rsp); stats[ST_F_LASTSESS] = mkf_s32(FN_AGE, be_lastsession(px)); - stats[ST_F_QTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.q_time, TIME_STATS_SAMPLES)); - stats[ST_F_CTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.c_time, TIME_STATS_SAMPLES)); - stats[ST_F_RTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.d_time, TIME_STATS_SAMPLES)); - stats[ST_F_TTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.t_time, TIME_STATS_SAMPLES)); + be_samples_counter = (px->mode == PR_MODE_HTTP) ? px->be_counters.p.http.cum_req : px->be_counters.cum_lbconn; + if (be_samples_counter < TIME_STATS_SAMPLES && be_samples_counter > 0) + be_samples_window = be_samples_counter; + + stats[ST_F_QTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.q_time, be_samples_window)); + stats[ST_F_CTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.c_time, be_samples_window)); + stats[ST_F_RTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.d_time, be_samples_window)); + stats[ST_F_TTIME] = mkf_u32(FN_AVG, swrate_avg(px->be_counters.t_time, be_samples_window)); stats[ST_F_QT_MAX] = mkf_u32(FN_MAX, px->be_counters.qtime_max); stats[ST_F_CT_MAX] = mkf_u32(FN_MAX, px->be_counters.ctime_max); diff --git a/src/stream.c b/src/stream.c index 4fa41a71d..1b8a86870 100644 --- a/src/stream.c +++ b/src/stream.c @@ -2441,6 +2441,7 @@ void stream_update_time_stats(struct stream *s) int t_data; int t_close; struct server *srv; + unsigned int samples_window; t_request = 0; t_queue = s->logs.t_queue; @@ -2463,19 +2464,23 @@ void stream_update_time_stats(struct stream *s) srv = objt_server(s->target); if (srv) { - swrate_add(&srv->counters.q_time, TIME_STATS_SAMPLES, t_queue); - swrate_add(&srv->counters.c_time, TIME_STATS_SAMPLES, t_connect); - swrate_add(&srv->counters.d_time, TIME_STATS_SAMPLES, t_data); - swrate_add(&srv->counters.t_time, TIME_STATS_SAMPLES, t_close); + samples_window = (((s->be->mode == PR_MODE_HTTP) ? + srv->counters.p.http.cum_req : srv->counters.cum_lbconn) > TIME_STATS_SAMPLES) ? TIME_STATS_SAMPLES : 0; + swrate_add_dynamic(&srv->counters.q_time, samples_window, t_queue); + swrate_add_dynamic(&srv->counters.c_time, samples_window, t_connect); + swrate_add_dynamic(&srv->counters.d_time, samples_window, t_data); + swrate_add_dynamic(&srv->counters.t_time, samples_window, t_close); HA_ATOMIC_UPDATE_MAX(&srv->counters.qtime_max, t_queue); HA_ATOMIC_UPDATE_MAX(&srv->counters.ctime_max, t_connect); HA_ATOMIC_UPDATE_MAX(&srv->counters.dtime_max, t_data); HA_ATOMIC_UPDATE_MAX(&srv->counters.ttime_max, t_close); } - swrate_add(&s->be->be_counters.q_time, TIME_STATS_SAMPLES, t_queue); - swrate_add(&s->be->be_counters.c_time, TIME_STATS_SAMPLES, t_connect); - swrate_add(&s->be->be_counters.d_time, TIME_STATS_SAMPLES, t_data); - swrate_add(&s->be->be_counters.t_time, TIME_STATS_SAMPLES, t_close); + samples_window = (((s->be->mode == PR_MODE_HTTP) ? + s->be->be_counters.p.http.cum_req : s->be->be_counters.cum_lbconn) > TIME_STATS_SAMPLES) ? TIME_STATS_SAMPLES : 0; + swrate_add_dynamic(&s->be->be_counters.q_time, samples_window, t_queue); + swrate_add_dynamic(&s->be->be_counters.c_time, samples_window, t_connect); + swrate_add_dynamic(&s->be->be_counters.d_time, samples_window, t_data); + swrate_add_dynamic(&s->be->be_counters.t_time, samples_window, t_close); HA_ATOMIC_UPDATE_MAX(&s->be->be_counters.qtime_max, t_queue); HA_ATOMIC_UPDATE_MAX(&s->be->be_counters.ctime_max, t_connect); HA_ATOMIC_UPDATE_MAX(&s->be->be_counters.dtime_max, t_data); -- 2.26.2

