Re: nxsem_tickwait_uninterruptible randomly timeouts one tick too soon?

Jukka Laitinen Mon, 22 May 2023 23:54:51 -0700

Hi,

I just came back to the issue mentioned below, reviewing some of thecode. I think that things go wrong in many levels (and this will belong, sorry about that):

- The patch mentioned below takes into use a common mtimer driver forall the risc-v platforms. This is fine as a concept, but I think thatthe common mtimer driver's interface is broken. Even though it is an OSinternal interface, it is operating on timespec_s values, forcing everyoperation to convert back-and-forth between MTIME counter value andtimespec - introducing possible rounding errors and unnecessaryperformance penalty. This kind of driver should work directly on thecounter resolution (typically microseconds), and simply compare countervalues and set the compare register to enable interrupts.

- The changes behind this also converts arch_alarm to only serve the OSsystick (the nxsched_process_timer is called from oneshot_callback), itcannot be used for anything else any more. IMHO, in the future the alarminterface should develop into 1) maintaining a queue of possible alarmsfrom possible clients and possibly 2) having multiple possible lowerhalfs (timers).

- There are now two "bases" for systick. The arch_alarm.c *calculates*the current systick directly from counter value (ONESHOT_TICK_CURRENT),and does at lest two conversions on the way, (counter->timespec->systick, which is bound to fail with certain selected frequencies). Elsewhere,system uses the global tick variable "g_system_ticks". These don'talways match.

- I suppose the "do-while" loop in arch_alarm.c:oneshot_callback somehowtries to catch up if there are missed interrupts/ticks or something...But I am not sure, I don't quite understand it.

All in all, this thing just doesn't work at all. Basically it skipsticks; it sometimes counts 2 ticks at once. For me that happens at firsttick always, after that randomly. I don't even want to start debuggingwhy it happens. Most likely because of the mess created withriscv_mtimer and arch_alarm. It is either a result of roundings betweentimespec <-> mtime counter <-> systick, a mismatch between thecalculated systick value and g_system_ticks or a race condition in thiscomplicated callback-after-callback contraption.

I won't continue debugging this further, because I have no idea how thisshould be fixed and what is the vision behind those changes which hasbeen made. I can also do some fork for common good for this, if there issome vision how this should work - for now I will just revert the patchin my own branches, to get the systick timer and the watchdogs workingreliably.

Note that this is broken in all risc-v platforms. Arm platforms seem tohandle the systick in their own arch specific codes as before.


Br,

Jukka


On 19.5.2023 15.51, Jukka Laitinen wrote:

Yes, it worked before, but a long time ago. I tested this on both onarm (stm32f7) and risc-v (mpfs) platforms.
I tracked the problem down to this patch:

commit 19758788356f8623bac5f439419e231ff81cac14
Author: Huang Qi <huang...@xiaomi.com>
Date:   Mon Apr 11 18:42:24 2022 +0800
    arch/risc-v: Apply common mtime driver to mtime based chps
    Signed-off-by: Huang Qi <huang...@xiaomi.com>

The problem seems to be specific to RISC-V platforms
If I revert the changes in my platform (mpfs) in filearch/risc-v/src/mpfs/mpfs_timerisr.c, and handle the timer interruptthere, everything seems to be working again.
The "common mtimer driver" seems a bit complex (using the alarminterface), and I don't have time to debug that right now, need tocome back to the issue later. Maybe there is some race conditionsomewhere.
Warning to others: this might be broken for other risc-v platforms aswell.
- Jukka


On 17.5.2023 18.51, Nathan Hartman wrote:
Was it working before? If so, are you able to use a git bisect to find
the commit where the bug was introduced? This might minimize the
amount of testing and debugging that needs to be done.
On Wed, May 17, 2023 at 11:12 AM Jukka Laitinen <jlait...@gmail.com>wrote:
Petro Karashchenko kirjoitti keskiviikko 17. toukokuuta 2023:
How do you measure the wait period? Are you togging a pin or usedMCU free
running HW timer?
I used RISC-V MTIMER, so it is a free running HW counter at 1usresolution
Best regards,
Petro
On Wed, May 17, 2023, 5:43 PM Jukka Laitinen<jukka.laiti...@iki.fi> wrote:
On 17.5.2023 16.38, Gregory Nutt wrote:
On 5/17/2023 7:21 AM, Gregory Nutt wrote:
On 5/17/2023 4:21 AM, Jukka Laitinen wrote:
Hi,

I just observed the behaviour mentioned in the subject;

I tried just calling in a loop:

"

     sem_t sem =SEM_INITIALIZER(0);

     int ret;

     ret = nxsem_tickwait_uninterruptible(&sem, 1);

"

, and never posting the sem from anywhere. The function return
-ETIMEDOUT properly on every call.

But when measuring the time spent in the wait, I see randomly that
sometimes the sleep time was less than one systick.
If I set systick to 10ms, I see typical (correct) sleep timebetween10000 - 20000us. But sometimes (very randomly) between 0 -10000us.
Also in these error cases the return value is correct (-110,
-ETIMEDOUT).

When sleeping for 2 ticks, I see randomly sleep times between
10000-20000us, for 3 ticks 20000-30000us. So, randomly it isexactly
one systick too small.

I looked through the implementation of the
"nxsem_tickwait_uninterruptible" itself, and didn't saw problem
there. (Actually, I think there is a bug if -EINTR occurs; in that
case it should always sleep at least one tick more - now itdoesn't.
But it is not related to this, in my test there was no -EINTR).
I believe the problem might be somewhere in sched/wdog/ , butso far
couldn't track down what causes it.

Has anyone else seen the same issue?

Br,

Jukka
If I understand what you are seeing properly, then it is normal and
correct behavior for a arbitrary  (asynchonous) timer.  See
https://cwiki.apache.org/confluence/display/NUTTX/Short+Time+Delays
for an explanation.
NuttX timers have always worked that way and has confused peoplethatuse the timers near the limits of their resolution. A solutionis to
use a very high resolution timer in tickless mode.
Oops.  You are seeing a timer that is 1 tick too short. That is an
error and should never happen. Sorry for reading incorrectly. Itwas
still early in the morning here.

The timer logic adds +1 tick to the requested to assure that that
error never occurs.  If +1 were not added, the bad result would be
exactly as you describe (and as explained in the confluencereference).
Hi, yes, exactly. Seeing timeout 1 tick too short. Sorry for not
explaining it clearly enough :)

I fear that there is now some bug. It was rather easy to re-produce,
just a loop with few thousand iterations, and it occurs (infiniteloop,10 ms tick, less than a minute to catch). Most of the time itworks ok;the sleep time is longer than the requested ticks. But when ittriggers,the sleep is exactly one tick too short (and shorter than therequested
timeout in ticks).
I was just asking, if others have seen this as well; I'd like toknow if
it is really a bug in current nuttx main. It is always possible that
there is something funny in our local build - although I can't seewhat
it could be.

-Jukka

Re: nxsem_tickwait_uninterruptible randomly timeouts one tick too soon?

Reply via email to