neels has posted comments on this change. (
https://gerrit.osmocom.org/c/osmo-bsc/+/25973 )
Change subject: add time_cc API: cumlative counter for time, reported as
rate_ctr
......................................................................
Patch Set 1:
> > Well maybe then the question is why are you using rate_ctr and not
> > stat_items here, it really confuses me.
>
> At least at first sight, I agree. The resulting metric computed by this new
> code base renders a single value which matches better a state_item than a
> rate_ctr. Any particular argument to go for rate_ctr, Neels?
The decision to use a rate_ctr is based on discussion with the customer,
and it also makes a lot of sence in practice.
Logically, a stat_item is not actually a good choice. We can of course report
the total time of all-allocated, and thus get for example the complete amount
of seconds that all SDCCH channels were allocated since osmo-bsc started. But
it's not interesting to get an arbitrary amount of time of all-allocated since
forever; instead, it is important to qualify in which period of elapsed time
this amount was accumulated. A rate_ctr is well suited since it also provides
the "per time" aspect. All rate_ctr stats reflect a number-of-events-per-time.
For all_allocated, it is the number of seconds that all channels were allocated
per a given amount of time. For example, if the VTY shows all_allocated:sdcch
of 10/min, it means all channels were allocated for 10 seconds of the last
minute. For a stat item, getting this "per time" part is a complex problem.
When reporting as a stat_item, we open a new dimension of options:
The spec defines different reporting periods, suggesting at least the options
of 5 minutes, 15 minutes, 30 minutes, 60 minutes. We could periodically clear
the stat item based on user config.
The customer requesting this feature already implements these reporting periods
outside of osmo-bsc, based on stats received from osmo-bsc. So instead of
introducing these reporting periods to osmo-bsc and choose some method of
adding a per-time aspect to stat_item, it is best to just trigger a count for
each second of all-allocated-channels.
> simply a counter value changing over time.
When I started on it, I thought it would take half an hour.
When thinking about the exact implementation, the options and complexity
unfolded...
This patch is the result that ensures correct counts with minimal complexity.
> So I'm not really following on why you need all this infrastructure sorry,
I would appreciate if your criticism could be qualified as well as constructive.
What do you mean by "all this"? What do you suggest instead?
> this all looks super complicated for no reason (I'm able to see). Maybe
> someone else can also shed some light on it.
It's straightforward:
The aim is to report for how many seconds per given time period all channels of
a type were allocated.
To achieve that, we need to count free/allocated lchans.
When a count reveals that all chans of type X are allocated, we set a flag to
true.
Based on that flag, a time counter increments. The flag-per-time counter is
generalized API (time_cc).
In order to periodically report that time counter to stats, an osmo_timer is
involved.
I am open to simplifications, if possible.
There are some additional options to configure time_cc with different
granularity,
and to allow tweaking the counter precision vs response time.
These options aren't strictly necessary. I think they make sense to keep
time_cc generally useful.
> So the question remains: Should the result be exposed as rate_ctr or as
> stat_item?
We could do both, in fact. All the complex parts are already implemented and
working correctly.
Next to the rate_ctr, we can just add a stat_item to time_cc, and publish the
time count as stat item. But then we need to define the time periods and exact
meaning of the stat_item values.
I encourage you to practically imagine the solution and you should see how the
problem is not as trivial as it sounds at first. It is easy to add the
stat_item, as soon as it is clear which value the stat_item should reflect. We
already have a value implemented that counts all seconds where all channels
were allocated since osmo-bsc started. But does it make sense to publish that
as stat_item?
Here are the various ideas I had before we decided for a rate_ctr as the
simplest and most effective solution:
"
I am thinking about the allAvailable{TCH,SDCCH}AllocatedTime indicators:
In 3GPP TS 52.402, there is a defined Granularity Period, which is configurable,
and suggested to have at least the settings of 5, 15, 30, 60 minutes.
The allAvailableXxxAllocatedTime indicators are defined as cumulative counter
(CC),
which I interpret as the number of seconds that all channels of the given kind
were occupied.
A "problem" is that the meaning of this cumulative value depends on the
Granularity Period.
For example, if the granularity period is 30 minutes, a cumulative value of 5
minutes for
"all channels allocated" means that the cell was congested roughly 17% of the
time.
If the granularity period is only 5 minutes, then the exact same value means
100% congestion.
So it appears to me that it is less confusing / more meaningful to report the
value in % of time?
Looking at details of how to implement this, it appears that we need to first
introduce this concept
of a Granularity Period to our statistics API. We have a stats reporter
interval, which is usually
a lot shorter than 5 minutes. Also this interval so far only affects the times
at which an independently
defined value will become reported. IIUC we so far don't have any values that
are dependent on the
reporting interval itself, where some cumulative counter value gets reset to
zero whenever a reporting
period has elapsed.
Here are my ideas to implement such cumulative counters:
variant 1:
Internally, we clearly define a Granularity Period, as described in the spec.
Let's say it is set to 5 minutes.
This Granularity Period is implemented completely independently from the stats
reporting period.
At first, the cumulative counter is zero. For the next 5 minutes, we add up the
times (in seconds) where all
channels were occupied. When the five minutes have elapsed, we "push" the
cumulative value to a stat item and
reset the counter. So only one value will be published in a stat item every 5
minutes, and the value does not
change while we are busy accumulating the counter value for the next 5 minutes.
This seems most spec conforming. But this also seems kind of low resolution /
slowly responsive.
The 5 minute period would be independent from the stat reporting period, i.e.
there would be N stat reporting
periods where the stat does not change at all, e.g. for 5 minutes, and only
then would we get a sum of the last
5 minutes, again staying fixed on the dashboard for the next 5 minutes.
variant 2:
We have two rate counters, one incrementing for each second where all channels
were occupied (A), one incrementing
for each second where at least one channel was still available (B). These get
reported continuously and also degrade
as rate counters do. Comparing one to the other, e.g. A / (A + B), gives a
continuous indication of congestion rate.
So the value will gradually rise and fall as the seconds pass, and we don't
have to wait five minutes to see that
congestion has occured.
variant 2b:
It should actually suffice to have only one rate counter incrementing for each
second where all channels were occupied.
Since rate counters implicitly count events per second, per minute, per hour,
we can see that e.g. a rate of
60 per minute means that we have been continuously congested for the last
minute.
variant 3:
We introduce a new kind of cumulative stat item which gets reset to zero
whenever a stat reporting period has elapsed.
We have two such stat items, one counting the seconds congested (A), one
counting seconds not congested (B),
and a meaningful statistic comes from comparing A to A+B. (the reporting period
may then fluctuate without ill effects)
variant 3b:
Such new cumulative stat item as in 3 may always implicitly report percent
compared to the elapsed reporting period.
variant 3c:
just use a normal stat item, and introduce some callback function that can be
set up to clear the stat item to zero
every time the stat report has been sent out.
For variant 2 (rate counters), we don't need to introduce configuration of a
granularity period, nor invent a new kind
of stat item. But this is also the farthest away from how the performance
indicator is defined in the spec.
We could also implement mutiple variants. To me it would make sense to
implement both variant 1 and 2b,
to have a most spec conforming stat item that reports less frequently, as well
as a "running congestion counter" as
a rate counter that continuously shows a curve of congestion seen per time.
"
--
To view, visit https://gerrit.osmocom.org/c/osmo-bsc/+/25973
To unsubscribe, or for help writing mail filters, visit
https://gerrit.osmocom.org/settings
Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: Icdd36f27cb54b2e1b940c9e6404ba9dd3692a310
Gerrit-Change-Number: 25973
Gerrit-PatchSet: 1
Gerrit-Owner: neels <[email protected]>
Gerrit-Reviewer: Jenkins Builder
Gerrit-CC: laforge <[email protected]>
Gerrit-CC: pespin <[email protected]>
Gerrit-Comment-Date: Mon, 01 Nov 2021 12:32:21 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment