Hello bugs@, Martin, Since a while I am noticing processes hanging on my Samsung Galaxy Book4 Edge (arm64/snapdragon-x/12-cores/16gb ram) machine. Those hangs appear very frequent, which makes it hard to work on the machine since things like xterm, ssh, man, etc. just suddenly start to hang. If this happens, executing another process would immediatly release the hanging/waiting process.
I've discussed this behavior today on icb, which has lead to the following conversation: 11:39 < mglocker> 5344 hacki -18 0 1436K 392K idle flt_pmf 0:00 0.00% man 11:41 < mglocker> uvm_wait("flt_pmfail1"); 11:42 < mglocker> uvm_wait("flt_pmfail2"); 11:43 < mglocker> 49811 hacki -18 0 8144K 112K sleep/0 flt_pmf 0:00 0.00% xterm 11:54 < mglocker> ok, the process hang is always at uvm/uvm_fault.c:1879 -> uvm_wait("flt_pmfail2") 12:17 < kettenis> so that's pmap_enter() failing 12:19 < kettenis> which means a pool allocation failure 12:20 < kettenis> what does vmstat -m say about the "pted" and "vp" pools? 12:28 < mglocker> Name Size Requests Fail InUse Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle 12:29 < mglocker> pted 40 962117 0 42480 1582 0 1582 1582 1 8 0 12:29 < mglocker> vp 8192 47009 102 5676 7830 1100 6730 7830 20 8 20 12:30 < mglocker> vp 102 fails? 12:37 < mglocker> it keeps increasing on those hangs 12:46 < mglocker> so pmap_enter_vp() fails for 12:46 < mglocker> vp2 = pool_get() 12:46 < mglocker> and 12:47 < mglocker> vp3 = pool_get() 13:00 < mglocker> i booted again with a fresh single processor kernel. there no vp fails. 13:09 < claudio> didn't we switch the vp pool to use per-cpu caches exactly because of this? 14:02 < kettenis> I believe so 14:03 < kettenis> the problem is that pmap_enter(9) isn't supposed to sleep 14:03 < kettenis> so the pool allocations are done with PR_NOWAIT 14:04 < kettenis> but that means that kd_trylock gets set 14:04 < kettenis> which means that the allocations fail if there is contention on the pool lock 14:04 < claudio> yes, I remeber this strange behaviour. 14:06 < kettenis> uvm things this means we're out of physmem 14:06 < kettenis> so it'll sleep until something else pokes the pagedaemon 14:06 < kettenis> the per-cpu mitigated the issue somewhat 14:07 < kettenis> but didn't solve things completely 14:07 < kettenis> and now that mpi pushed back the locks in uvm again, the problem is back 14:09 < kettenis> so we need a real solution for this problem... 14:12 < kettenis> a potential solution would be to make pmap_enter(9) return a different error for this case 14:13 < kettenis> and then handle that case different in uvm_fault_{upper|lower} 14:15 < kettenis> the problem there is that pool_get() doesn't actually tell us why it failed 14:37 < kettenis> s/contention on the pool lock/contention on the kernal map/ Any proposal on how we could proceed to find a solution for this issue? Cheers, Marcus