Hello bugs@, Martin,

Since a while I am noticing processes hanging on my Samsung Galaxy Book4
Edge (arm64/snapdragon-x/12-cores/16gb ram) machine.  Those hangs appear
very frequent, which makes it hard to work on the machine since things
like xterm, ssh, man, etc. just suddenly start to hang.  If this happens,
executing another process would immediatly release the hanging/waiting
process.

I've discussed this behavior today on icb, which has lead to the
following conversation:

11:39 < mglocker> 5344 hacki    -18    0 1436K  392K idle      flt_pmf   0:00  
0.00% man
11:41 < mglocker> uvm_wait("flt_pmfail1");
11:42 < mglocker> uvm_wait("flt_pmfail2");
11:43 < mglocker> 49811 hacki    -18    0 8144K  112K sleep/0   flt_pmf   0:00  
0.00% xterm
11:54 < mglocker> ok, the process hang is always at uvm/uvm_fault.c:1879 -> 
uvm_wait("flt_pmfail2")

12:17 < kettenis> so that's pmap_enter() failing
12:19 < kettenis> which means a pool allocation failure
12:20 < kettenis> what does vmstat -m say about the "pted" and "vp" pools?
12:28 < mglocker> Name        Size Requests Fail    InUse Pgreq Pgrel Npage 
Hiwat Minpg Maxpg Idle
12:29 < mglocker> pted          40   962117    0    42480  1582     0  1582  
1582     1     8    0
12:29 < mglocker> vp          8192    47009  102     5676  7830  1100  6730  
7830    20     8   20
12:30 < mglocker> vp 102 fails?
12:37 < mglocker> it keeps increasing on those hangs
12:46 < mglocker> so pmap_enter_vp() fails for
12:46 < mglocker> vp2 = pool_get()
12:46 < mglocker> and
12:47 < mglocker> vp3 = pool_get()
13:00 < mglocker> i booted again with a fresh single processor kernel.  there 
no vp fails.
13:09 < claudio> didn't we switch the vp pool to use per-cpu caches exactly 
because of this?
14:02 < kettenis> I believe so
14:03 < kettenis> the problem is that pmap_enter(9) isn't supposed to sleep
14:03 < kettenis> so the pool allocations are done with PR_NOWAIT
14:04 < kettenis> but that means that kd_trylock gets set
14:04 < kettenis> which means that the allocations fail if there is contention 
on the pool lock
14:04 < claudio> yes, I remeber this strange behaviour.
14:06 < kettenis> uvm things this means we're out of physmem
14:06 < kettenis> so it'll sleep until something else pokes the pagedaemon
14:06 < kettenis> the per-cpu mitigated the issue somewhat
14:07 < kettenis> but didn't solve things completely
14:07 < kettenis> and now that mpi pushed back the locks in uvm again, the 
problem is back
14:09 < kettenis> so we need a real solution for this problem...
14:12 < kettenis> a potential solution would be to make pmap_enter(9) return a 
different error for this case
14:13 < kettenis> and then handle that case different in uvm_fault_{upper|lower}
14:15 < kettenis> the problem there is that pool_get() doesn't actually tell us 
why it failed
14:37 < kettenis> s/contention on the pool lock/contention on the kernal map/

Any proposal on how we could proceed to find a solution for this issue?

Cheers,
Marcus

Reply via email to