Stephen Borrill <net...@precedence.co.uk> writes: > On Sat, 4 Nov 2023, Simon Burge wrote: >> Greg Troxel wrote: >> >>> So to me this feels like a locking botch in a rare path in zfs. >> >> This appears to be the case. Chuck Silvers has some understanding of >> the problem and I'm helping test, but at this stage there isn't a fix >> available. :/ > > It's interesting that you see the lockups during pkgsrc builds, i.e. a > period where there is lots of file creation. We use zfs on backup > systems that pull in data with rsync. During the initial runs (where > every file is new) we usually get a couple of lockups, but during day > to day operation (few changes) it is reliable. These are on physical > and virtual machines running NetBSD 9 with the rule of thumb of 1GB > RAM per TB of storage obeyed, but no patches besides setting MAXPHYS > in the module to 32k for Xen.
I just had another problem, on the non-xen 32GB machine (which has 3.5T in the pool, only half full). The machine wasn't really doing much; X running with xfce, a few xterms, ssh client, pidgin, and idling firefox with I think 24 tabs. I found it mostly normal and was using an ssh session, and then switched to the firefox virtual desktop which failed to redraw. I tried to kill firefox (because firefox hanging is not so strange :-() and found that a few of the tabs appeared to be stuck in flt_noram and zio_buf. I think there might have been a different wchan earlier that was zfs but not zio_buf. I think it got in this state due to firefox leaking memory (in SIZE but not RES?). (So it might be a missing wakeup on flt_noram, but lock not released seems plausible also. Totally guessing here.) (As I was composing this message (in tmux on another machine), the firefox lockup deteriorated to more things hanging and then a total lockup. I was unable to ctrl-alt-f1 to get back to the text console. It is responding to mdns queries and pings and sshd answers but I see "local version string" and the "remote protocol" line does not appear. I should try LOCKDEBUG on the package building box (where if it doesn't work right that's much more ok!). 10853 129 9817 0 85 0 2762264 180704 uvnfp2 DEl ttyp5 7:12.20 (firefox) 10853 994 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox) 10853 1867 9817 0 85 0 3423848 723944 uvnfp2 DEl ttyp5 146:52.32 (firefox) 10853 7407 9817 0 85 0 20169184 355160 flt_nora DEl ttyp5 63:48.76 (firefox) 10853 7630 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox) 10853 8451 9817 0 85 0 2712376 126588 uvnfp2 DEl ttyp5 7:09.93 (firefox) 10853 8504 9817 0 85 0 2744608 143008 uvnfp2 DEl ttyp5 9:56.45 (firefox) 10853 9817 1 21 117 0 12939188 948252 zio_buf_ DEl ttyp5 303:41.53 (firefox) 10853 11066 9817 0 85 0 2821832 225664 &db->db_ DEl ttyp5 1:34.01 (firefox) 10853 11769 9817 0 85 0 2849780 232172 uvnfp2 DEl ttyp5 9:19.27 (firefox) 10853 12055 9817 0 85 0 2832852 144304 uvnfp2 DEl ttyp5 8:49.22 (firefox) 10853 13075 9817 0 85 0 2782516 193652 plpg DEl ttyp5 9:00.21 (firefox) 10853 15399 9817 0 85 0 2822236 249496 uvnfp2 DEl ttyp5 10:12.41 (firefox) 10853 15991 9817 0 85 0 2775316 187104 uvnfp2 DEl ttyp5 7:13.63 (firefox) 10853 16033 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox) 10853 16877 9817 0 85 0 2731156 148896 uvnfp2 DEl ttyp5 1:59.22 (firefox) 10853 17275 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox) 10853 19768 9817 0 85 0 2760188 152880 uvnfp2 DEl ttyp5 7:11.17 (firefox) 10853 21618 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox) 10853 24342 9817 0 85 0 2737588 148452 uvnfp2 DEl ttyp5 11:51.61 (firefox) 10853 24956 9817 0 85 0 2981764 336852 uvnfp2 DEl ttyp5 20:20.13 (firefox) 10853 26368 9817 0 85 0 3164560 240992 uvnfp2 DEl ttyp5 19:28.72 (firefox) 10853 26981 9817 1123 85 0 3659088 770432 flt_nora DEl ttyp5 84:09.22 (firefox) 10853 27139 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox) 10853 29076 9817 2270 85 0 2975552 261064 flt_nora DEl ttyp5 88:44.15 (firefox) top says Memory: 14G Act, 6989M Inact, 88M Wired, 549M Exec, 13G File, 228M Free Swap: 40G Total, 348M Used, 40G Free / Pools: 11G Used so it did get into paging vmstat -s: 4096 bytes per page 16 page colors 8079588 pages managed 58419 pages free 3733123 pages active 1789074 pages inactive 1 pages paging 22427 pages wired 1 reserve pagedaemon pages 40 reserve kernel pages 252503 boot kernel pages 2817259 kernel pool pages 2027769 anonymous pages 3376469 cached file pages 140490 cached executable pages 2048 minimum free pages 2730 target free pages 2693196 maximum wired pages 1 swap devices 10485759 swap pages 89099 swap pages in use 534237 swap allocations 6412584353 total faults taken 6413831611 traps 78980016 device interrupts 1337982749 CPU context switches 312160043 software interrupts 11693270083 system calls 519672 pagein requests 84753 pageout requests 0 pages swapped in 1355436 pages swapped out 11042006 forks total 7107987 forks blocked parent 7107987 forks shared address space with parent 5114776798 pagealloc desired color avail 328102950 pagealloc desired color not avail 4442734240 pagealloc local cpu avail 1000145508 pagealloc local cpu not avail 15518 faults with no memory 0 faults with no anons 1780 faults had to wait on pages 0 faults found released page 18115368 faults relock (18042145 ok) 301349249 anon page faults 510693 anon retry faults 588841483 amap copy faults 238859694 neighbour anon page faults 6051925265 neighbour object page faults 1842970725 locked pager get faults 17593932 unlocked pager get faults 193768531 anon faults 107569755 anon copy on write faults 1381042165 object faults 461851046 promote copy faults 4244253805 promote zero fill faults 4295634 faults upgraded lock 15025 faults couldn't upgrade lock 5042 times daemon wokeup 2631 revolutions of the clock hand 29898211 pages freed by daemon 45830406 pages scanned by daemon 992050 anonymous pages scanned by daemon 28906451 object pages scanned by daemon 321731 pages reactivated 0 pages found busy by daemon 1270683 total pending pageouts 49696575 pages deactivated 272335878 per-cpu stats synced 686222 anon pages possibly dirty 1341547 anon pages dirty 0 anon pages clean 70660 file pages possibly dirty 56 file pages dirty 3446243 file pages clean 20979779248 total name lookups 19150722941 good hits 1317343206 negative hits 10803620 bad hits 0 false hits 500909481 miss 0 too long 968276 pass2 hits 1088078 2passes 17136909 reverse hits 10934 reverse miss 0 access denied cache hits (91% pos + 6% neg) system 0% per-process deletions 0%, falsehits 0%, toolong 0%