linguini1 commented on PR #17224: URL: https://github.com/apache/nuttx/pull/17224#issuecomment-3429959657
> ## Summary > > Previously the function `up_cpu_pause` was used during flash write operations for preventing all other CPUs from executing code from flash. The above function was removed in [d8cb775](https://github.com/apache/nuttx/commit/d8cb7759b657cf92dfb66aa8cf8c4ff6391a4d40). Thus, writing to flash on `rp2040` was only possible in non-SMP mode. > > This PR allows write operations to the flash on rp2040 even in SMP mode. It blocks all but the current CPU for the duration of the critical function (write or erase). > > The implementation is based on the previous implementation (see [d8cb775](https://github.com/apache/nuttx/commit/d8cb7759b657cf92dfb66aa8cf8c4ff6391a4d40)). It is slightly enhanced by supporting more than two CPUs and it avoids global variables. Thank you for the detailed test information! It's always appreciated a lot. Just to wrap my head around this, did d8cb7759 break the RP2040 code for SMP flash writing entirely? As in, it worked before this commit and then began to fail after that commit (I assume you found this with `git bisect`)? If so, that is a huge issue with the way that commit was tested before merge, we might want to see if there are other cases where things broke. > ## Details to be discussed? > > 1. The proposed implementation of the "pause all other CPUs" feature is generic enough to be helpful for other architectures, as well. E.g. the `esp32_spiram.c` and `esp32s3_spiram.c` use a similar implementation for guarding their flash accesses (see `g_cpu_pause`). `rp2350` will surely needs this, too, when flash write support is added. > > * Should I move this set of function (`*_smp_isolation`) somewhere else (e.g. below `sched/sched/`)? Which name would be appropriate? (`*_smp_isolation` is just my silly choice) I think that's not a bad idea if it's generally applicable. I think it would be fine to merge this fix first since it solves (presumably) a regression, and then afterward you could see about re-factoring the solution to use your more general implementation? > * Should I propose a separate pull request with the respective changes for esp32 in order to allow others to test them? (I do not have such a board) Yes, there are some people working on Espressif boards and can probably run your changes against their internal CI to catch anything. I would like to note that it's very appreciated that you mention this, since there are a lot of PRs which only test one board and assume the changes apply just as well elsewhere. > * Should I revive `sched_note_cpu_pause`? Currently it is probably unused. What I'm wondering is, do the `up_cpu_pause/resume` and etc. functions removed in the linked commit not achieve the same task that you'd achieve by creating your general SMP isolation functions? Because then you could just revive the old implementations. I haven't looked thoroughly, so if your new implementation would be more streamlined (you mentioned no globals) then just go with that. I'm not familiar with the semantics of `sched_note_cpu_pause` vs `up_cpu_pause`, etc. > 2. The implementation uses "spin_lock_t" (instead of `bool` flags). I picked this for readability. Is this a burden? Should I use `bool` instead? > > 3. The implementation moves the "enter_critical_section" call closer to the flash write operation (inside the "SMP isolation" calls). Previously `enter_critical_section` was called _before_ `up_cpu_pause`. Now it is called afterwards. > > * I am not sure, _why_ this is necessary, but the original call location leads to random hangs after writing around 50 kB. > > 4. The functions `enter_smp_isolation` and `leave_smp_isolation` call `sched_lock`/`sched_unlock`. This may feel a bit excessive, but I think, it is necessary: > Can't really help with any of this, sorry. I'm not familiar enough with much of the scheduling logic to say anything of value. > * a) `nxsched_smp_call_async` tries to run the "pause handler" directly on the current CPU, if `this_cpu()` is part of the given cpuset (see [source](https://github.com/apache/nuttx/blob/master/sched/sched/sched_smp.c#L317)). This would starve our current task and prevent the execution of "leave_smp_isolation". This behavior of "nxsched_smp_call_async" is probably a bug. Maybe you could raise an issue for it and post to the mailing list to get more eyes on this? > * b) We should not allow other SMP-aware functions to run, because they may rely on SMP being enabled (causing deadlocks?). This is possible, since `CONFIG_SMP` is used everywhere as the indicator for "multiple processes running in parallel". This is not true during the "smp isolation" context created by these functions. > > 5. At the moment, the blocked CPUs may receive and handle interrupts. At least I do not see anything preventing this. Maybe we should run "enter_critical_section" for each busy-locked CPU in order to avoid this? Ditto, not informed enough to help with this either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
