Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

Willy Tarreau Mon, 30 Jan 2017 11:25:29 -0800

Hi Krishna,

back on earth ;-)

On Tue, Jan 03, 2017 at 03:07:26PM +0530, Krishna Kumar (Engineering) wrote:
> I explored your suggestion of "hard-coded periods", and have some
> problems: code complexity seems to be very high at updates (as well
> as retrievals possibly); and I may not be able to get accurate results.
> E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is
> made for retrieval of the last 16 seconds (or 1,4,16). At this time I have
> values for last 18 seconds not 16 seconds. I explored using timers to
> cascade (will not work as it may run into races with the setters, and
> also adds too much overhead) vs doing this synchronously when the
> event happens. Both are complicated and have the above issue of not
> able to get accurate information depending on when the request is
> made.

The principle is quite simple but requires a small explanation of how our
frequency counters manage to report such accurate values first.

We're counting events that happen over a period, represented by a cross
over the time line below :

    X     X  X        X  X    X  X     X     X     X   X
  |--------------------------------------------------------> t

In order to be able to measure an approximately accurate frequency without
storing too much data, we'll report an average over a period. The problem
is, if we only report a past average, we'll get slowly changing data and
in particular it would not be usable to detect quick changes such as a high
rate happening over the last second. Thus we always keep two measures, the
one of the previous period, and the one of the current period.

    X     X  X        X  X    X  X     X     X     X   X
  |-----------------------------|--------------------------> t
      <--- previous --->        |    <--- current --->
           count=6                        count=5

And using this we implement a pseudo sliding window. With a true sliding
window, you would count events between two points that are always exactly
the same distance apart, which requires to memorize all of them. With
this pseudo sliding window, instead, we make use of the fact that events
follow an approximately homogenous distribution over each period and have
approximately the same frequency over both periods. Thus the mechanism
becomes more of a "fading" window in fact : we consider one part of the
previous window that we sum with the current one. The closer the current
date is from the end of the current window, the least we count on the
previous window. See below, I've represented 5 different sliding windows
with the ratio of the previous window that we consider as still valid
marked with stars and the ratio of the old period at the end of the line :

    X     X  X        X  X    X  X     X     X     X   X
  |-----------------------------|----------------------------> t
        |***********************------|  80%
              |*****************------------|  60%
                    |***********------------------|  40%
                          |*****------------------------|  20%
                                |-----------------------------|  0%

Thus the measure exactly is current + (1-relative_position) * previous.
Once the period is over, the "current" value overwrites the "previous"
one, and is cleared again, waiting for new events. This is done during
insertion and possibly during value retrieval if it's noticed that the
period has switched since last time.

In fact this mechanism can be repeated over multiple periods if needed,
the principle must always be the same :

    X     X  X        X  X    X  X     X     X     X   X
  |------------|------------|------------|------------|----------> t
      old          N-3         N-2           N-1        current

The "old" period is the one on which we apply the fading effect. The
current period is the one being entirely accounted (since it's being
filled as we're looking at it) so here you could always keep a few
past values (rotating copies of "current") and keep the last one for
the fade out. In the simplified case above we just don't have the
N-1..N-M, we only have "current" and "old".

Thus as you can see here, the "current" period is the only dynamic one,
which is the exact sum of events over the last period. But now what if
we wanted to implement this with multiple levels ? It's very easy, the
"current" period being the sum of events, it simply is the sum of the
lower layers when you cascade multiple counters, thus it's the lower
layer's sum of [current] + [old*ratio] + [n-1] + [n-2] +...+ [n-m].

Thus let's have an example with periods of 1s, 2s, 4s and 8s :

              previous                         current
 [8s] |----------------------------|------------------------------> t
 [4s]                              |---------------|-------------->
 [2s]                                              |-------|------>
 [1s]                                                      |---|-->

I guess you're seeing a pattern : each layer's "current" value is useless
as it's derived from the lower layers values. So you don't even have
to implement the current part (except for the smallest precision layer),
you just need each layer's previous value. If we call "Pn" the previous
value for layer (n)s, "Rn" the ratio applied to this due to the current
time position within the current period of (n) seconds, "Cn" the current
value for layer "n", then you quickly see that :

       N - 1
        ___
        \
   Cn =  > Pi + C0
        <__
       i = 1

   Avg_rate(n) = Pn * Rn + Cn

What this implies is that by storing only 8 counters, you can have an
history of 1, 2, 4, 8, 16, 32 and 64s which still moves as fast as the
quotient represented by the movements of the measurement period.

And better, if you want different periods, you can apply the mechanism
above to store more periods. For example we could imagine that you store :

   60s :  old, N-3, N-2, N-1
   15s :  old, N-2, N-1
    5s :  old, N-4, N-3, N-2, N-1
    1s :  old, cur

So the 60s avg can be computed as [N60-3]+[N60-2]+[N60-1]+old*ratio +
lower layers.

Updates are not complicated at all, however they can propagate along a
long chain, but normally this chain should not be too long, and the longest
the chain, the rarest they apply (the frequency of a chain of N nodes is
1/2^N), so for example in the example with 64s above you would only update
7 levels once per minute, which is ridiculously cheap.

The variable sizes above (eg: 1s, 5s, 15s, 60s) are appealing for the end
user but more complicated to implement because you need to know the number
of buckets at each layer, so that makes the computation functions less
generic (or rely on more information that you have to store). Also, while
dealing with only two buckets per level is trivial (a simple copy), for
the other numbers, you need to implement some kind of a rotation, with the
current pointer determined by a modulus on the current time, which also is
less appealing. But if you go with powers of 4 like (1, 4, 16, 64), it's
simplifying quite a bit since you only have to store 4 buckets per layer
and the modulus is a simple binary AND on the time.

> To implement your suggestion of say histograms, the retrieval code can
> calculate the 4 values (1, 4, 16, and 64 seconds) by averaging across
> the correct intervals. In this case, the new CLI command is not required,
> and by default it prints all 4 values. Would this work in your opinion?

Yes that's exactly the point. If you have only a few values to display,
you can always display them.

I hope I helped you see your way through the solution better and did not
confuse you too much :-)

Best regards,
Willy

Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

Reply via email to