Re: Re: sched lock/unlock issue on rp2350

Kevin Witteveen Thu, 24 Apr 2025 04:34:04 -0700

Oh something to add;
The serial output is typically messed up (usually when "help" is ran, or
printf use). I tried to debug it, but the debugger "gives up / skips" on
random instructions. Or fails to write variables.
People seem to have accepted this behavior as normal, but I am very worried
about it as it gives more clues to the issue from the bigger picture.



Op do 24 apr 2025 om 13:30 schreef Kevin Witteveen <[email protected]>:

> Hi,
>
> I am just going to share this here too.
> I experienced ostest fails both on RP2040 and RP2350.
> Disabling SMP on RP2350 only moved the problem elsewhere.
> It stopped the ostest fails for a bit, but when adding more applications
> and threads, the issue comes back.
> Sometimes it hangs, rarely it halts. All depends on how you built it, what
> you put in it and in what order.
> Typical memory corruption issues.
>
> It is harder to replicate though. However, the RP2x is very very unstable
> from what I have experienced so far.
>
> I am not sure if we should focus all in on the SMP specifically.
>
> Op do 24 apr 2025 om 13:24 schreef Serg Podtynnyi
> <[email protected]>:
>
>> Yes, I understood that, what if CPU A and CPU B enters the unlock routine
>> with 1 lockcount on the same task and after the decrement they both go into
>> preemption and manipulating with linked list task list?
>>
>> On April 24, 2025 8:06:36 AM UTC, hujun260 <[email protected]> wrote:
>> >The lockcount can be understood as a local variable, so no race
>> conditions will occur.
>> >编辑
>> >分享
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >At 2025-04-24 15:54:48, "Serg Podtynnyi" <[email protected]>
>> wrote:
>> >>Hi, I tried my best to understand the logic behind unlock and
>> >>merge_pending routines and come up with this fix that works for me.
>> >>I split the lockcount handling in 2 parts when lockcount is > 1 just
>> >>decrement fast as always,
>> >>but when it's 1 we enter critical section and make it 0 inside
>> >>it(pre-emption is enabled).
>> >>I think there is a small time after decrement in the if statement and
>> >>  enter_critical_section_wo_note() in original code where can race
>> happen.
>> >>Does this make sense? Sorry if it's a dumb question - scheduler stuff
>> >>with SMP is rocket-science level for me yet.
>> >>
>> >>diff --git a/sched/sched/sched_unlock.c b/sched/sched/sched_unlock.c
>> >>
>> >>index 4c7d681454..15fc39b546 100644
>> >>--- a/sched/sched/sched_unlock.c
>> >>+++ b/sched/sched/sched_unlock.c
>> >>@@ -65,13 +65,14 @@ void sched_unlock(void)
>> >>
>> >>        DEBUGASSERT(rtcb == NULL || rtcb->lockcount > 0);
>> >>
>> >>-      /* Check if the lock counter has decremented to zero. If so,
>> >>-       * then pre-emption has been re-enabled.
>> >>-       */
>> >>-
>> >>-      if (rtcb != NULL && --rtcb->lockcount == 0)
>> >>+      if (rtcb != NULL && rtcb->lockcount > 1)
>> >>+        {
>> >>+          rtcb->lockcount--;
>> >>+        }
>> >>+      else if (rtcb != NULL && rtcb->lockcount == 1)
>> >>          {
>> >>            irqstate_t flags = enter_critical_section_wo_note();
>> >>+          rtcb->lockcount  = 0;
>> >>
>> >>            /* Note that we no longer have pre-emption disabled. */
>> >>
>> >>--
>> >>2.49.0
>> >>On 4/22/25 14:32, hujun260 wrote:
>> >>
>> >>It seems to have nothing to do with the race condition, and since the
>> rtcb is this_task, it will only be modified by one CPU.<br/><br/>As for why
>> your modification can solve the problem, I haven't figured it out yet. Can
>> you further analyze the reasons for the failure to boot?
>> >>At 2025-04-22 13:25:13, "Serg Podtynnyi" <[email protected]>
>> wrote:
>> >>
>> >>> Hi All, Gregory, Sebastien,
>> >>>
>> >>> I'm porting NuttX on rp2350(pico2 board inside ClockWork PicoCalc) and
>> >>> having trouble getting SMP (2-core) config running
>> >>>
>> >>> without patching the sched lock/unlock routines.
>> >>>
>> >>> I am looking at the commit
>> >>>
>> https://github.com/apache/nuttx/commit/914ae532e68bf4ca75c758d6f541a1d2fcdce8c3
>> >>> and reverting it helps to boot the system.
>> >>>
>> >>> This the the patch to make it work for me, I updated the current
>> >>> lock/unlock by moving the enter_critical_section above the access to
>> >>> rtcb->lockcount
>> >>>
>> >>> Do you think it could be the case of a race condition with
>> rtcb->lockcount ?
>> >>>
>> >>> PS
>> >>>
>> >>> It's almost 11 years since I last touched NuttX - it's a please to
>> work
>> >>> with it again.
>> >>>
>> >>>
>> >>>  From 86fa0a65fdd37804154faf3db52e2826281cdfbd Mon Sep 17 00:00:00
>> 2001
>> >>> From: Serg Podtynnyi <[email protected]>
>> >>> Date: Tue, 22 Apr 2025 11:37:48 +0700
>> >>> Subject: sched: update lock/unlock critical section entry
>> >>>
>> >>> Level up enter_critical_section in sched lock/unlock to cover
>> >>> rtcb->lockcount access
>> >>>
>> >>> Signed-off-by: Serg Podtynnyi <[email protected]>
>> >>> ---
>> >>>   sched/sched/sched_lock.c   | 4 ++--
>> >>>   sched/sched/sched_unlock.c | 4 ++--
>> >>>   2 files changed, 4 insertions(+), 4 deletions(-)
>> >>>
>> >>> diff --git a/sched/sched/sched_lock.c b/sched/sched/sched_lock.c
>> >>> index d8f75e7c90..9cc8da3df4 100644
>> >>> --- a/sched/sched/sched_lock.c
>> >>> +++ b/sched/sched/sched_lock.c
>> >>> @@ -82,18 +82,18 @@ void sched_lock(void)
>> >>>          * operations on this thread (on any CPU)
>> >>>          */
>> >>>
>> >>> +      irqstate_t flags = enter_critical_section_wo_note();
>> >>>         if (rtcb != NULL && rtcb->lockcount++ == 0)
>> >>>           {
>> >>>   #if (CONFIG_SCHED_CRITMONITOR_MAXTIME_PREEMPTION >= 0) || \
>> >>>       defined(CONFIG_SCHED_INSTRUMENTATION_PREEMPTION)
>> >>> -          irqstate_t flags = enter_critical_section_wo_note();
>> >>>
>> >>>             /* Note that we have pre-emption locked */
>> >>>
>> >>>             nxsched_critmon_preemption(rtcb, true, return_address(0));
>> >>>             sched_note_preemption(rtcb, true);
>> >>> -          leave_critical_section_wo_note(flags);
>> >>>   #endif
>> >>>           }
>> >>> +      leave_critical_section_wo_note(flags);
>> >>>       }
>> >>>   }
>> >>> diff --git a/sched/sched/sched_unlock.c b/sched/sched/sched_unlock.c
>> >>> index 4c7d681454..d21e007587 100644
>> >>> --- a/sched/sched/sched_unlock.c
>> >>> +++ b/sched/sched/sched_unlock.c
>> >>> @@ -69,9 +69,9 @@ void sched_unlock(void)
>> >>>          * then pre-emption has been re-enabled.
>> >>>          */
>> >>>
>> >>> +      irqstate_t flags = enter_critical_section_wo_note();
>> >>>         if (rtcb != NULL && --rtcb->lockcount == 0)
>> >>>           {
>> >>> -          irqstate_t flags = enter_critical_section_wo_note();
>> >>>
>> >>>             /* Note that we no longer have pre-emption disabled. */
>> >>>
>> >>> @@ -162,7 +162,7 @@ void sched_unlock(void)
>> >>>               }
>> >>>   #endif
>> >>>
>> >>> -          leave_critical_section_wo_note(flags);
>> >>>           }
>> >>> +      leave_critical_section_wo_note(flags);
>> >>>       }
>> >>>   }
>> >>> --
>> >>> 2.49.0
>> >>>
>> >>>
>> >>> --
>> >>> Serg Podtynnyi
>> >>>
>> >>-- Serg Podtynnyi
>>
>

Re: Re: sched lock/unlock issue on rp2350

Reply via email to