Hi,
I just came back to the issue mentioned below, reviewing some of the
code. I think that things go wrong in many levels (and this will be
long, sorry about that):
- The patch mentioned below takes into use a common mtimer driver for
all the risc-v platforms. This is fine as a concept, but I think that
the common mtimer driver's interface is broken. Even though it is an OS
internal interface, it is operating on timespec_s values, forcing every
operation to convert back-and-forth between MTIME counter value and
timespec - introducing possible rounding errors and unnecessary
performance penalty. This kind of driver should work directly on the
counter resolution (typically microseconds), and simply compare counter
values and set the compare register to enable interrupts.
- The changes behind this also converts arch_alarm to only serve the OS
systick (the nxsched_process_timer is called from oneshot_callback), it
cannot be used for anything else any more. IMHO, in the future the alarm
interface should develop into 1) maintaining a queue of possible alarms
from possible clients and possibly 2) having multiple possible lower
halfs (timers).
- There are now two "bases" for systick. The arch_alarm.c *calculates*
the current systick directly from counter value (ONESHOT_TICK_CURRENT),
and does at lest two conversions on the way, (counter->timespec->systick
, which is bound to fail with certain selected frequencies). Elsewhere,
system uses the global tick variable "g_system_ticks". These don't
always match.
- I suppose the "do-while" loop in arch_alarm.c:oneshot_callback somehow
tries to catch up if there are missed interrupts/ticks or something...
But I am not sure, I don't quite understand it.
All in all, this thing just doesn't work at all. Basically it skips
ticks; it sometimes counts 2 ticks at once. For me that happens at first
tick always, after that randomly. I don't even want to start debugging
why it happens. Most likely because of the mess created with
riscv_mtimer and arch_alarm. It is either a result of roundings between
timespec <-> mtime counter <-> systick, a mismatch between the
calculated systick value and g_system_ticks or a race condition in this
complicated callback-after-callback contraption.
I won't continue debugging this further, because I have no idea how this
should be fixed and what is the vision behind those changes which has
been made. I can also do some fork for common good for this, if there is
some vision how this should work - for now I will just revert the patch
in my own branches, to get the systick timer and the watchdogs working
reliably.
Note that this is broken in all risc-v platforms. Arm platforms seem to
handle the systick in their own arch specific codes as before.
Br,
Jukka
On 19.5.2023 15.51, Jukka Laitinen wrote:
Yes, it worked before, but a long time ago. I tested this on both on
arm (stm32f7) and risc-v (mpfs) platforms.
I tracked the problem down to this patch:
commit 19758788356f8623bac5f439419e231ff81cac14
Author: Huang Qi <huang...@xiaomi.com>
Date: Mon Apr 11 18:42:24 2022 +0800
arch/risc-v: Apply common mtime driver to mtime based chps
Signed-off-by: Huang Qi <huang...@xiaomi.com>
The problem seems to be specific to RISC-V platforms
If I revert the changes in my platform (mpfs) in file
arch/risc-v/src/mpfs/mpfs_timerisr.c, and handle the timer interrupt
there, everything seems to be working again.
The "common mtimer driver" seems a bit complex (using the alarm
interface), and I don't have time to debug that right now, need to
come back to the issue later. Maybe there is some race condition
somewhere.
Warning to others: this might be broken for other risc-v platforms as
well.
- Jukka
On 17.5.2023 18.51, Nathan Hartman wrote:
Was it working before? If so, are you able to use a git bisect to find
the commit where the bug was introduced? This might minimize the
amount of testing and debugging that needs to be done.
On Wed, May 17, 2023 at 11:12 AM Jukka Laitinen <jlait...@gmail.com>
wrote:
Petro Karashchenko kirjoitti keskiviikko 17. toukokuuta 2023:
How do you measure the wait period? Are you togging a pin or used
MCU free
running HW timer?
I used RISC-V MTIMER, so it is a free running HW counter at 1us
resolution
Best regards,
Petro
On Wed, May 17, 2023, 5:43 PM Jukka Laitinen
<jukka.laiti...@iki.fi> wrote:
On 17.5.2023 16.38, Gregory Nutt wrote:
On 5/17/2023 7:21 AM, Gregory Nutt wrote:
On 5/17/2023 4:21 AM, Jukka Laitinen wrote:
Hi,
I just observed the behaviour mentioned in the subject;
I tried just calling in a loop:
"
sem_t sem =SEM_INITIALIZER(0);
int ret;
ret = nxsem_tickwait_uninterruptible(&sem, 1);
"
, and never posting the sem from anywhere. The function return
-ETIMEDOUT properly on every call.
But when measuring the time spent in the wait, I see randomly that
sometimes the sleep time was less than one systick.
If I set systick to 10ms, I see typical (correct) sleep time
between
10000 - 20000us. But sometimes (very randomly) between 0 -
10000us.
Also in these error cases the return value is correct (-110,
-ETIMEDOUT).
When sleeping for 2 ticks, I see randomly sleep times between
10000-20000us, for 3 ticks 20000-30000us. So, randomly it is
exactly
one systick too small.
I looked through the implementation of the
"nxsem_tickwait_uninterruptible" itself, and didn't saw problem
there. (Actually, I think there is a bug if -EINTR occurs; in that
case it should always sleep at least one tick more - now it
doesn't.
But it is not related to this, in my test there was no -EINTR).
I believe the problem might be somewhere in sched/wdog/ , but
so far
couldn't track down what causes it.
Has anyone else seen the same issue?
Br,
Jukka
If I understand what you are seeing properly, then it is normal and
correct behavior for a arbitrary (asynchonous) timer. See
https://cwiki.apache.org/confluence/display/NUTTX/Short+Time+Delays
for an explanation.
NuttX timers have always worked that way and has confused people
that
use the timers near the limits of their resolution. A solution
is to
use a very high resolution timer in tickless mode.
Oops. You are seeing a timer that is 1 tick too short. That is an
error and should never happen. Sorry for reading incorrectly. It
was
still early in the morning here.
The timer logic adds +1 tick to the requested to assure that that
error never occurs. If +1 were not added, the bad result would be
exactly as you describe (and as explained in the confluence
reference).
Hi, yes, exactly. Seeing timeout 1 tick too short. Sorry for not
explaining it clearly enough :)
I fear that there is now some bug. It was rather easy to re-produce,
just a loop with few thousand iterations, and it occurs (infinite
loop,
10 ms tick, less than a minute to catch). Most of the time it
works ok;
the sleep time is longer than the requested ticks. But when it
triggers,
the sleep is exactly one tick too short (and shorter than the
requested
timeout in ticks).
I was just asking, if others have seen this as well; I'd like to
know if
it is really a bug in current nuttx main. It is always possible that
there is something funny in our local build - although I can't see
what
it could be.
-Jukka