On 07/07/25(Mon) 14:11, Jeremie Courreges-Anglas wrote: > Thanks folks for your replies, > > On Mon, Jul 07, 2025 at 12:26:21PM +0200, Mark Kettenis wrote: > > > Date: Mon, 7 Jul 2025 08:17:37 +0200 > > > From: Martin Pieuchot <m...@grenadille.net> > > > > > > On 06/07/25(Sun) 21:15, Jeremie Courreges-Anglas wrote: > > > > On Tue, Jul 01, 2025 at 06:18:37PM +0200, Jeremie Courreges-Anglas > > > > wrote: > > > > > On Tue, Jun 24, 2025 at 05:21:56PM +0200, Jeremie Courreges-Anglas > > > > > wrote: > > > > > > > > > > > > I think it's uvm_purge(), as far as I can see it happens when > > > > > > building > > > > > > rust with cvs up -D2025/06/04 in /sys, not with -D2025/06/03. > > > > > > Maybe I > > > > > > missed lang/rust when testing the diff. > > > > > > > > > > > > This is with additional MP_LOCKDEBUG support for mutexes, and > > > > > > __mp_lock_spinout = 50L * INT_MAX. > > > > > > > > > > > > Suggested by claudio: tr /t 0t269515 fails. > > > > > > > > > > Should be fixed, at least for CPUs that ddb actually managed to > > > > > stop... > > > > > > > > > > > WITNESS doesn't flag an obvious lock ordering issue. I'm not even > > > > > > sure there is one. It also happen with CPU_MAX_BUSY_CYCLES == 64. > > > > > > > > > > > > Maybe we're still hammering too much the locks? Input and ideas to > > > > > > test welcome. Right now I'm running with just uvm_purge() reverted. > > > > > > > > > > Reverting uvm_purge() did not help. I've been able to reproduce the > > > > > hangs up to cvs up -D2025/05/18, backporting the arm64 ddb trace fix, > > > > > mpi's mutex backoff diff and the mtx_enter MP_LOCKDEBUG diff. > > > > > Currently trying to reproduce with cvs up -D2025/05/07 as suggested by > > > > > dlg. > > > > > > > > cvs up -D2025/05/07 also did not help. -current still locks up, but > > > > the db_mutex fix helps get proper stacktraces: > > > > > > Thanks for the report. > > > > > > > --8<-- > > > > login: mtf_fnxe0:00nf2ffmrte8 0m21xt2 8xlef_ cf2nktft esl8r_u0f k00t > > > > sxor2f0 > > > > > > > > 0tf > > > > 8oeff > > > > > > > > ftff > > > > efStopped at mtx_enter+0x13c: ldr x26, [x25,#2376] > > > > > > What is your value of __mp_lock_spinout? > > 50L * INT_MAX > > > > >From the ddb traces I understand that the `sched_lock' mutex is > > > >contended. > > > It's not clear to me why but I believe that's because rustc use > > > sched_yield(2) in a loop. Could you figure out where this syscall is > > > coming from? > > Frankly, I don't know if I want to try and figure it out. ^^"
I understand. However this is IMHO the crucial piece of work that has to be done. Alternatively one can replace all _spinlock() in libc with a primitive that do not call a syscall in a loop. > > > I'm upset that sched_yield() is still used and causing trouble. Now > > > that `sched_lock' is a mutex without guarantee of progress it is easy > > > to hang the machine by calling it in a loop. A proper solution would be > > > to stop using sched_yield(2). This will bite us as long as it is here. > > Regarding its uses, there's one in librthread/rthread_sched.c which > looks legitimate, and also another in libc that caught my eye recently: > > /* > * internal support functions > */ > void > _spinlock(volatile _atomic_lock_t *lock) > { > while (_atomic_lock(lock)) > sched_yield(); > membar_enter_after_atomic(); > } > DEF_STRONG(_spinlock); > > That one looks a bit heavy-handed... This is certainly this one, question is which one of them? > Now, if the problem really is that our mutexes aren't fair, then isn't > any mutex (incl the sched lock) usable for such hammering from > userland? Hammering is the problem here. In the new ddb trace we see 4 rustc threads in the middle of sched_yield(2)... This is really an example of how not to do multi-threading... The one on CPU6 is in the middle of mi_switch() and doesn't appear on ps /o. It holds the `sched_lock' and there's no guarantee that CPU0 which holds the KERNEL_LOCK() and want to do wakeup(9) will grab it. > > Hmm, well, mutexes might not be fair, but they should guarantee > > forward progress. And the LL/SC primitives on arm64 do guarantee > > forward progress, but there are certain conditions. And I think our > > current MI mutex implementation violates those conditions. And > > WITNESS and MP_LOCKDEBUG might actually be making things worse. > > I thought LL/SC guarantees depended on the length of the section > between the load and store, but... I usually fail to find what I'm > looking for when looking at ARM docs. > > > Now on the M2 we should be using CAS instead of LL/SC. > > Just to rule out an obvious issue: > > ddb{0}> x arm64_has_lse > arm64_has_lse: 1 > > > The > > architecture reference manual isn't explicit about forward guarantees > > for that instruction. So I need to dig a bit deeper into this. I > > doubt there is no forward progress guarantee for those instructions, > > but maybe there are similar conditions on how these get used. > > > > Will dig deeper and see if we can fix the mutex implementation on > > arm64. > > ack, thanks I'm not sure the issue is with the primitive. > > > A workaround would be to use a backoff mechanism inside the loop. > > > > > > Another workaround could be to never spin on the `sched_lock' in > > > sys_sched_yield() and instead sleep. > > > > > > Because of such contention all wakeup(9)s inside the pmemrange allocator > > > add contention on the `fpageqlock' which is what is "hanging" your > > > machine. > > > > > > Diff below is another workaround that might help. > > Alas it doesn't seem to help enough to avoid hangs. Here's another > fresh report using your diff. It seems to help because the contention is now on the kernel lock and no longer on the global pmemrange mutex. > --8<-- > login: __mp_lock_spin: 0xffffff8001326508 lock spun out > Stopped at db_enter+0x18: brk #0x0 > ddb{4}> mach cpuinfo > 0: stopped > 1: stopped > 2: stopped > 3: stopped > * 4: ddb > 5: stopped > 6: stopped > 7: stopped > 8: stopped > 9: stopped > ddb{4}> mach ddb 0 > Stopped at db_enter+0x18: brk #0x0 > ddb{0}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at __aarch64_cas8_acq_rel+0x1c > __aarch64_cas8_acq_rel() at wakeup_n+0x44 > wakeup_n() at task_add+0x6c > task_add() at ifiq_input+0x1f4 > ifiq_input() at bge_rxeof+0x32c > bge_rxeof() at bge_intr+0x174 > bge_intr() at aplintc_irq_handler+0x170 > aplintc_irq_handler() at arm_cpu_irq+0x44 > arm_cpu_irq() at handle_el1h_irq+0x68 > handle_el1h_irq() at mtx_enter+0x104 > mtx_enter() at endtsleep+0x30 > endtsleep() at timeout_run+0xb4 > timeout_run() at softclock_process_tick_timeout+0xc4 > softclock_process_tick_timeout() at softclock+0xe8 > softclock() at softintr_dispatch+0xc8 > softintr_dispatch() at arm_do_pending_intr+0xf4 > arm_do_pending_intr() at mi_switch+0x1b0 > mi_switch() at ast+0x98 > ast() at handle_el0_sync+0x9c > handle_el0_sync() at 0x19ae4be318 > --- trap --- > end of kernel > ddb{0}> mach ddb 1 > Stopped at db_enter+0x18: brk #0x0 > ddb{1}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at __mp_lock+0x104 > __mp_lock() at __mp_acquire_count+0x34 > __mp_acquire_count() at uiomove+0x12c > uiomove() at ffs_read+0x198 > ffs_read() at VOP_READ+0x3c > VOP_READ() at vn_rdwr+0xb0 > vn_rdwr() at vmcmd_map_readvn+0x8c > vmcmd_map_readvn() at exec_process_vmcmds+0x78 > exec_process_vmcmds() at sys_execve+0x620 > sys_execve() at svc_handler+0x430 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x13812aa6d8 > --- trap --- > end of kernel > ddb{1}> mach ddb 2 > Stopped at db_enter+0x18: brk #0x0 > ddb{2}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at __mp_lock+0x104 > __mp_lock() at doopenat+0x160 > doopenat() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x4521566bc > --- trap --- > end of kernel > ddb{2}> mach ddb 3 > Stopped at db_enter+0x18: brk #0x0 > ddb{3}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at __aarch64_cas8_acq_rel+0x1c > __aarch64_cas8_acq_rel() at sys_sched_yield+0x68 > sys_sched_yield() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x19ae4b2fdc > --- trap --- > end of kernel > ddb{3}> mach ddb 4 > Stopped at db_enter+0x18: brk #0x0 > ddb{4}> tr > db_enter() at __mp_lock+0x128 > __mp_lock() at doopenat+0x160 > doopenat() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x43e0b76bc > --- trap --- > end of kernel > ddb{4}> mach ddb 5 > Stopped at db_enter+0x18: brk #0x0 > ddb{5}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at __mp_lock+0xf8 > __mp_lock() at vn_statfile+0x2c > vn_statfile() at sys_fstat+0x84 > sys_fstat() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x43dcf999c > --- trap --- > end of kernel > ddb{5}> mach ddb 6 > Stopped at db_enter+0x18: brk #0x0 > ddb{6}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at agtimer_get_timecount_default+0x18 > agtimer_get_timecount_default() at nanouptime+0x44 > nanouptime() at tuagg_add_runtime+0x40 > tuagg_add_runtime() at mi_switch+0x90 > mi_switch() at sys_sched_yield+0x88 > sys_sched_yield() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x19ae4b2fdc > --- trap --- > end of kernel > ddb{6}> mach ddb 7 > Stopped at db_enter+0x18: brk #0x0 > ddb{7}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at __mp_lock+0xf8 > __mp_lock() at sys___realpath+0x154 > sys___realpath() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x4af9eacc8 > --- trap --- > end of kernel > ddb{7}> mach ddb 8 > Stopped at db_enter+0x18: brk #0x0 > ddb{8}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at aplintc_splx+0x1f0 > aplintc_splx() at mtx_enter+0xe8 > mtx_enter() at sys_sched_yield+0x68 > sys_sched_yield() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x19ae4b2fdc > --- trap --- > end of kernel > ddb{8}> mach ddb 9 > Stopped at db_enter+0x18: brk #0x0 > ddb{9}> tr > db_enter() at aplintc_fiq_handler+0x6c > aplintc_fiq_handler() at arm_cpu_fiq+0x44 > arm_cpu_fiq() at handle_el1h_fiq+0x68 > handle_el1h_fiq() at mtx_enter+0x10c > mtx_enter() at sys_sched_yield+0x68 > sys_sched_yield() at svc_handler+0x450 > svc_handler() at do_el0_sync+0x1cc > do_el0_sync() at handle_el0_sync+0x70 > handle_el0_sync() at 0x19ae4b2fdc > --- trap --- > end of kernel > ddb{9}> show all locks > CPU 0: > shared mutex timeout r = 0 (0xffffff8001202d88) > CPU 6: > exclusive mutex &sched_lock r = 0 (0xffffff8001272148) > Process 16821 (ninja) thread 0xffffff8113e80068 (92806) > exclusive rrwlock inode r = 0 (0xffffff8055608760) > Process 40072 (rustc) thread 0xffffff8113e814a8 (446746) > exclusive rrwlock inode r = 0 (0xffffff81049e8538) > Process 23063 (rustc) thread 0xffffff8113e819b8 (12581) > exclusive kernel_lock &kernel_lock r = 0 (0xffffff8001326d10) > ddb{9}> show witness /b > Number of known direct relationships is 443 > > Lock order reversal between "&mp->mnt_lock"(rwlock) and > "&ip->i_lock"(rrwlock)! > > witness: incomplete path, depth 8 > ddb{9}> ps /o > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 92806 16821 55 0x1005 0 1 ninja > 285104 167 55 0x3 0 2 c++ > 202504 45511 55 0x3 0 7 c++ > 288505 96331 55 0x3 0 4 c++ > 134651 73257 55 0x3 0 5 c++ > 230543 23063 55 0x2000003 0x4000000 3 rustc > 12581 23063 55 0x2000003 0x4000000 0K rustc > 4934 23063 55 0x2000003 0x4000000 8 rustc > * 89729 23063 55 0x2000003 0x4000000 9 rustc > ddb{9}> ps > PID TID PPID UID S FLAGS WAIT COMMAND > 16821 92806 23175 55 7 0x1005 ninja > 167 285104 63255 55 7 0x3 c++ > 63255 492627 23175 55 3 0x10008b sigsusp sh > 40072 77074 38114 55 3 0x2000083 fsleep rustc > 40072 100553 38114 55 3 0x6000083 piperd rustc > 40072 446746 38114 55 2 0x6000003 rustc > 45511 202504 4053 55 7 0x3 c++ > 4053 80003 23175 55 3 0x10008b sigsusp sh > 96331 288505 78782 55 7 0x3 c++ > 78782 47058 23175 55 3 0x10008b sigsusp sh > 73257 134651 99268 55 7 0x3 c++ > 99268 333490 23175 55 3 0x10008b sigsusp sh > 23063 335091 38114 55 3 0x2000083 fsleep rustc > 23063 363979 38114 55 3 0x6000083 piperd rustc > 23063 428745 38114 55 3 0x6000083 fsleep rustc > 23063 279995 38114 55 3 0x6000083 fsleep rustc > 23063 181877 38114 55 3 0x6000083 fsleep rustc > 23063 230543 38114 55 7 0x6000003 rustc > 23063 216080 38114 55 2 0x6000003 rustc > 23063 12581 38114 55 7 0x6000003 rustc > 23063 4934 38114 55 7 0x6000003 rustc > *23063 89729 38114 55 7 0x6000003 rustc > 38114 473886 68745 55 3 0x2000083 fsleep cargo > 38114 327225 68745 55 3 0x6000083 fsleep cargo > 38114 207803 68745 55 3 0x6000083 kqread cargo > 38114 442514 68745 55 3 0x6000083 kqread cargo > 68745 91160 66756 55 3 0x2200083 wait python3.12 > 66756 145557 20253 55 3 0x10008b sigsusp make > 23175 305437 95548 55 3 0x8b kqread ninja > 95548 326376 95211 55 3 0x83 kqread cmake > 95211 452507 60695 55 3 0x10008b sigsusp make > 20253 453028 74977 1000 3 0x10008b sigsusp make > 74977 118496 85512 1000 3 0x10008b sigsusp sh > 60695 512137 28928 1000 3 0x10008b sigsusp make > 28928 132313 46768 1000 3 0x10008b sigsusp sh > 46768 510404 31892 1000 3 0x10008b sigsusp make > 85512 168761 5112 1000 3 0x10008b sigsusp make > 31892 41853 33691 1000 3 0x10008b sigsusp ksh > 5112 45762 33691 1000 3 0x10008b sigsusp ksh > 33691 339742 1 1000 3 0x100080 kqread tmux > 50447 258378 72048 1000 3 0x100083 kqread tmux > 52948 303715 1 1000 3 0x1000b8 kqread ssh-agent > 72048 282038 80004 1000 3 0x10008b sigsusp ksh > 80004 478023 75098 1000 3 0x98 kqread sshd-session > 75098 88401 26790 0 3 0x92 kqread sshd-session > 16285 372990 1 0 3 0x100083 ttyin getty > 88950 327084 1 0 3 0x100083 ttyin getty > 32299 412662 1 0 3 0x100083 ttyin getty > 9624 388470 1 0 3 0x100083 ttyin getty > 95942 199073 1 0 3 0x100083 ttyin getty > 13386 235018 1 0 3 0x100083 ttyin getty > 71288 139796 1 0 3 0x100098 kqread cron > 58489 449220 1 99 3 0x1100090 kqread sndiod > 88897 458345 1 110 3 0x100090 kqread sndiod > 21307 145191 57732 95 3 0x1100092 kqread smtpd > 85854 126275 57732 103 3 0x1100092 kqread smtpd > 62845 395981 57732 95 3 0x1100092 kqread smtpd > 68484 261396 57732 95 3 0x100092 kqread smtpd > 96401 408083 57732 95 3 0x1100092 kqread smtpd > 83314 38871 57732 95 3 0x1100092 kqread smtpd > 57732 508617 1 0 3 0x100080 kqread smtpd > 26790 121973 1 0 3 0x88 kqread sshd > 29643 367483 12647 0 3 0x80 nfsd nfsd > 94049 228378 12647 0 3 0x80 nfsd nfsd > 6946 26461 12647 0 3 0x80 nfsd nfsd > 27528 512446 12647 0 3 0x80 nfsd nfsd > 12647 261198 1 0 3 0x80 netacc nfsd > 40776 365439 11876 0 3 0x80 kqread mountd > 11876 241363 1 0 3 0x100080 kqread mountd > 56481 14460 1 28 3 0x1100090 kqread portmap > 59772 105103 17556 83 3 0x100092 kqread ntpd > 17556 51510 78827 83 3 0x1100092 kqread ntpd > 78827 213143 1 0 3 0x100080 kqread ntpd > 58198 68980 84345 74 3 0x1100092 bpf pflogd > 84345 315647 1 0 3 0x80 sbwait pflogd > 4462 325006 88024 73 3 0x1100090 kqread syslogd > 88024 393773 1 0 3 0x100082 sbwait syslogd > 20177 177437 1 0 3 0x100080 kqread resolvd > 1916 512556 22600 77 3 0x100092 kqread dhcpleased > 89301 356030 22600 77 3 0x100092 kqread dhcpleased > 22600 388065 1 0 3 0x80 kqread dhcpleased > 9674 57681 39128 115 3 0x100092 kqread slaacd > 99969 364740 39128 115 3 0x100092 kqread slaacd > 39128 377692 1 0 3 0x100080 kqread slaacd > 7488 127254 0 0 3 0x14200 bored wsdisplay0 > 80941 354199 0 0 3 0x14200 bored apple-dcp-afkep%02x > 83845 251298 0 0 3 0x14200 bored apple-dcp-afkep%02x > 48577 323255 0 0 3 0x14200 bored apple-dcp-afkep%02x > 26744 433388 0 0 3 0x14200 bored drmrtk > 23659 215894 0 0 3 0x14200 bored smr > 62145 444779 0 0 3 0x14200 pgzero zerothread > 60948 51173 0 0 3 0x14200 aiodoned aiodoned > 61934 416352 0 0 3 0x14200 syncer update > 60595 230019 0 0 3 0x14200 cleaner cleaner > 59553 366424 0 0 3 0x14200 reaper reaper > 57148 36768 0 0 3 0x14200 pgdaemon pagedaemon > 4700 330347 0 0 3 0x14200 usbtsk usbtask > 37673 384731 0 0 3 0x14200 usbatsk usbatsk > 71799 233642 0 0 3 0x14200 bored drmtskl > 72815 226502 0 0 3 0x14200 bored drmlwq > 22325 377620 0 0 3 0x14200 bored drmlwq > 58569 438671 0 0 3 0x14200 bored drmlwq > 35034 511019 0 0 3 0x14200 bored drmlwq > 51245 343392 0 0 3 0x14200 bored drmubwq > 5539 210307 0 0 3 0x14200 bored drmubwq > 96204 103587 0 0 3 0x14200 bored drmubwq > 99133 343617 0 0 3 0x14200 bored drmubwq > 62111 109225 0 0 3 0x14200 bored drmhpwq > 54844 103964 0 0 3 0x14200 bored drmhpwq > 98273 6361 0 0 3 0x14200 bored drmhpwq > 69125 196587 0 0 3 0x14200 bored drmhpwq > 36820 327811 0 0 3 0x14200 bored drmwq > 685 197752 0 0 3 0x14200 bored drmwq > 12733 475339 0 0 3 0x14200 bored drmwq > 50628 50813 0 0 3 0x14200 bored drmwq > 38742 421411 0 0 3 0x14200 bored sensors > 15252 489393 0 0 3 0x14200 bored sleep > 45280 351306 0 0 3 0x40014200 idle9 > 18476 19261 0 0 3 0x40014200 idle8 > 73955 129108 0 0 3 0x40014200 idle7 > 40102 188571 0 0 3 0x40014200 idle6 > 68869 24066 0 0 3 0x40014200 idle5 > 16358 379475 0 0 3 0x40014200 idle4 > 28248 237855 0 0 3 0x40014200 idle3 > 14101 490124 0 0 3 0x40014200 idle2 > 48008 167892 0 0 3 0x40014200 idle1 > 44662 441379 0 0 3 0x14200 bored softnet3 > 18453 428842 0 0 3 0x14200 bored softnet2 > 49989 83580 0 0 3 0x14200 bored softnet1 > 93898 345769 0 0 3 0x14200 bored softnet0 > 64190 206376 0 0 3 0x14200 bored systqmp > 95969 416641 0 0 3 0x14200 bored systq > 38042 457408 0 0 3 0x14200 tmoslp softclockmp > 57671 339967 0 0 3 0x40014200 tmoslp softclock > 32509 359030 0 0 3 0x40014200 idle0 > 99051 67429 0 0 3 0x14200 kmalloc kmthread > 1 12407 0 0 3 0x82 wait init > 0 0 -1 0 3 0x10200 scheduler swapper > ddb{9}> > -->8-- > > -- > jca