Attention is currently required from: iedemam, laforge, fixeria, pespin.
neels has posted comments on this change. ( 
https://gerrit.osmocom.org/c/osmo-bsc/+/27081 )

Change subject: stats: new trackers for lchan life duration
......................................................................


Patch Set 15:

(9 comments)

Patchset:

PS15:
The patch is turning out nicely, but we still have a couple of problems.
Some of those are my fault of giving recommendations in code review without 
thinking it through.
I hope I am not causing frustration by "always having yet more comments".


File src/osmo-bsc/lchan_fsm.c:

https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/c0c177bc_13624a22
PS15, Line 208:         if (lchan->active_ms.cfg.rate_ctr) {
I understand now: the rate_ctr is set only for SDCCH and TCH type.


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/085bc7b3_caa11933
PS15, Line 209:                 osmo_time_cc_cleanup(&lchan->active_ms);
osmo_time_cc_cleanup() is not necessary, the forget_sum configuration does this 
implicitly


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/9a4b8bc7_c242c7e9
PS15, Line 465: uint
> Is it a valid type from <stdint.h>? Never saw it before. […]
I like to use "uint" in private projects, but there i typedef it manually as 
unsigned int. Indeed i've never seen it in osmocom code. it is certainly not 
from stdint.h, wondering why this compiles.


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/f5fed5b1_a399fc61
PS15, Line 478:         counter_id = -1;
> This should go to the 'default' branch of the 'switch' statement below.
could, yes, but totally fine here IMHO.

But in fact, I would rather see counters for all channel types added, instead 
of counting "only" TCH and SDCCH. Those are the most interesting ones of 
course, but why not do all of them while we're at it.


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/724984ee_b0144023
PS15, Line 479:         switch (lchan->type) {
the lchan->type changes during operation of osmo-bsc, it is not known at the 
time of lchan_fsm_alloc() yet. For example, a TCH/F timeslot can be used as 
type SDCCH when all dedicated SDCCH timeslots are exhausted, or a dynamic 
timeslot can be used as TCH/F or TCH/H depending.

I think I would do this:

This here, lchan_fsm_alloc(), happens "only once" when a BTS starts up.
Configure here the lchan->active_ms.cfg, i.e. set the gran_usec and 
forget_sum_usec, unconditionally. But here do not set a rate_ctr yet (keep it 
NULL). So each lchan now has an accumulator for the time that it is active -- 
as long as the rate_ctr is NULL, all it does is increment its total_sum, no 
harm done. So we can always set the flag to true or false, for any and all 
lchan types, regardless of the rate_ctr being present.

Just before setting the flag to 'true' in lchan_on_fully_established(), point 
the lchan->active_ms.cfg.rate_ctr to the counter matching the lchan->type, 
because only then do we know the actual lchan->type in use. (We already know 
the type a few FSM states earlier, but we only really care upon fully 
established, when the flag goes true.)
lchan->active_ms always directs new counter increments to the rate_ctr it 
currently points at.
When setting the rate_ctr, do take care to set it to NULL if the lchan type has 
no counter. (personally I would add counters for all types while at it, but 
it's not mandatory)

The active_ms time_cc does never need to be reset, no need to call 
osmo_time_cc_cleanup(). forget_sum takes care that leftover timings are 
discarded; especially when forget_sum_usec = 1 it's guaranteed that we 
instantly forget surplus timings from previous times of flag == true. As soon 
as newly accumulated sums reach the threshold, a counter increment is 
triggered, to whichever rate counter it currently points at.

(Like this, we also have in active_ms.total_sum a sum of how long each 
individual lchan has been active since the BTS started operating, across all 
lchan->types it has operated as. We could maybe print that in the VTY output, 
as a nice-to-know value, not sure if it bears much meaning though, besides 
which lchans are the favorites.)


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/b461f303_2281ae3f
PS15, Line 493:                                 .gran_usec = 1*1000,
Sorry for my earlier comment, I've come to realize what exactly I recommended 
by mentioning 1 ms granularity. By setting gran_usec to one millisecond, we 
will actually schedule an osmo_timer to increment a rate counter every single 
millisecond that the flag is true. Modern machines should be able to handle 
that, but I fear it might make a difference in load. 1000 times per second for 
every active lchan feels like a lot for me as a human.

With gran_usec == 1000 and flag == true, every millisecond we re-schedule an 
osmo_timer to come up in the main select() loop. It will increment the rate 
counter, and schedule another timer to "interrupt" the select() loop a 
millisecond later. Mind you it is fine if the select() loop skips a couple of 
milliseconds here or there from handling messages, the rate counter will still 
be incremented correctly. But we would load the main select() loop a lot less 
with a larger granularity.

What is a meaningful precision you would like to see in the metrics? If one 
second is too large, I guess a tenth of a second (gran_usec = 100000) is 
sufficient for a humanly noticeable time scale?

(Would be interesting to benchmark that, see if this granularity does indeed 
make any difference at all on the CPU load. So far it's just a gut feeling of 
mine.)


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/63b93209_f57327e9
PS15, Line 498:                                 .T_forget_sum = -18,
timers X16, X17, X18 are defined to configure a different counter (the 
all_allocated counters). important: these X timers will override above 
gran_usec, forget_sum_usec settings!

I think it is sufficient to not define any T_* and always use the hardcoded 
configuration here.

If you'd like to make the active_ms configurable from VTY, we should define 
different X numbers.

Related:
https://osmocom.org/projects/cellular-infrastructure/wiki/List_of_Timer_numbers


https://gerrit.osmocom.org/c/osmo-bsc/+/27081/comment/d7f488bb_41c99b30
PS15, Line 562:         /* Stop the timekeeper, which triggers a report */
the comment suggests that the report is triggered only now. In fact a rate_ctr 
increment is triggered every time gran_usec elapses the rounding threshold 
while the flag is set 'true', and not really when the flag changes to 'false'.



--
To view, visit https://gerrit.osmocom.org/c/osmo-bsc/+/27081
To unsubscribe, or for help writing mail filters, visit 
https://gerrit.osmocom.org/settings

Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: I1b0670c47cb5e0b7776eda89d1e71545ba0e3347
Gerrit-Change-Number: 27081
Gerrit-PatchSet: 15
Gerrit-Owner: iedemam <[email protected]>
Gerrit-Reviewer: Jenkins Builder
Gerrit-Reviewer: laforge <[email protected]>
Gerrit-Reviewer: neels <[email protected]>
Gerrit-CC: fixeria <[email protected]>
Gerrit-CC: pespin <[email protected]>
Gerrit-Attention: iedemam <[email protected]>
Gerrit-Attention: laforge <[email protected]>
Gerrit-Attention: fixeria <[email protected]>
Gerrit-Attention: pespin <[email protected]>
Gerrit-Comment-Date: Tue, 08 Mar 2022 22:35:27 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: fixeria <[email protected]>
Gerrit-MessageType: comment

Reply via email to